type
Post
status
Published
date
Apr 3, 2025
slug
2025/04/03/Monitoring and Tuning the Linux Networking Stack: Receiving Data | Packagecloud Blog
summary
tags
Linux
category
Linux
created days
new update day
icon
password
Created_time
Mar 6, 2025 02:41 AM
Last edited time
Apr 3, 2025 06:49 AM
TL;DR(总结)
This blog post explains how computers running the Linux kernel receive packets, as well as how to monitor and tune each component of the networking stack as packets flow from the network toward userland programs.
本文将介绍运行 Linux 内核的计算机如何接收数据包,以及在数据包从网络流向用户态程序的过程中,如何监控和调整网络栈的各个组件。
UPDATE We’ve released the counterpart to this post: Monitoring and Tuning the Linux Networking Stack: Sending Data.
我们发布了本文的姊妹篇:监控和调优 Linux 网络栈:数据发送。
UPDATE Take a look at the Illustrated Guide to Monitoring and Tuning the Linux Networking Stack: Receiving Data, which adds some diagrams for the information presented below.
查看监控和调优 Linux 网络栈:数据接收的图文指南,其中为以下内容添加了一些图表。
It is impossible to tune or monitor the Linux networking stack without reading the source code of the kernel and having a deep understanding of what exactly is happening.
This blog post will hopefully serve as a reference to anyone looking to do this.
如果不阅读内核源代码并深入理解具体发生的情况,就无法对 Linux 网络栈进行监控或调优。希望这篇文章能为任何想要进行此项工作的人提供参考。
Special thanks(特别感谢)
Special thanks to the folks at Private Internet Access who hired us to research this information in conjunction with other network research and who have graciously allowed us to build upon the research and publish this information.
特别感谢 Private Internet Access 的工作人员,他们聘请我们进行此项信息研究,以及其他网络研究,并慷慨地允许我们在此研究基础上进行拓展,并发布这些信息。
The information presented here builds upon the work done for Private Internet Access, which was originally published as a 5 part series starting here.
本文中的信息是在为 Private Internet Access 所做工作的基础上构建的,该工作最初以五部分系列文章的形式发布,可从此处开始阅读。
General advice on monitoring and tuning the Linux networking stack(监控和调优 Linux 网络栈的一般建议)
The networking stack is complex and there is no one size fits all solution. If the performance and health of your networking is critical to you or your business, you will have no choice but to invest a considerable amount of time, effort, and money into understanding how the various parts of the system interact.
网络栈非常复杂,没有一种通用的解决方案。如果网络性能和健康状况对您或您的企业至关重要,那么您别无选择,只能投入大量的时间、精力和资金,来了解系统各个部分是如何相互作用的。
Ideally, you should consider measuring packet drops at each layer of the network stack. That way you can determine and narrow down which component needs to be tuned.
理想情况下,您应该考虑在网络栈的每一层测量数据包丢失情况。这样,您就可以确定并缩小需要调整的组件范围。
This is where, I think, many operators go off track: the assumption is made that a set of sysctl settings or
/proc
values can simply be reused wholesale. In some cases, perhaps, but it turns out that the entire system is so nuanced and intertwined that if you desire to have meaningful monitoring or tuning, you must strive to understand how the system functions at a deep level. Otherwise, you can simply use the default settings, which should be good enough until further optimization (and the required investment to deduce those settings) is necessary.我认为,许多运维人员在此处误入歧途:他们假设可以直接复用一组 sysctl 设置或
/proc
值。在某些情况下,也许可行,但整个系统实际上非常微妙且相互交织,如果您希望进行有意义的监控或调优,就必须努力深入了解系统的运行机制。否则,您可以直接使用默认设置,在需要进一步优化(以及为推导这些设置所需的投入)之前,默认设置通常已经足够好了。Many of the example settings provided in this blog post are used solely for illustrative purposes and are not a recommendation for or against a certain configuration or default setting. Before adjusting any setting, you should develop a frame of reference around what you need to be monitoring to notice a meaningful change.
本文中提供的许多示例设置仅用于说明目的,并非对特定配置或默认设置的推荐或反对。在调整任何设置之前,您应该围绕需要监控的内容建立一个参考框架,以便注意到有意义的变化。
Adjusting networking settings while connected to the machine over a network is dangerous; you could very easily lock yourself out or completely take out your networking. Do not adjust these settings on production machines; instead make adjustments on new machines and rotate them into production, if possible.
通过网络连接到机器时调整网络设置是很危险的,您很可能会将自己锁定在外,或者完全中断网络连接。请勿在生产机器上调整这些设置;如果可能的话,应在新机器上进行调整,然后将其轮换到生产环境中。
Overview(概述)
For reference, you may want to have a copy of the device data sheet handy. This post will examine the Intel I350 Ethernet controller, controlled by the
igb
device driver. You can find that data sheet (warning: LARGE PDF) here for your reference.为便于参考,您可能需要手头备有一份设备数据表。本文将研究由
igb
设备驱动程序控制的英特尔 I350 以太网控制器。您可以在此处找到该数据表(警告:大型 PDF 文件)以供参考。The high level path a packet takes from arrival to socket receive buffer is as follows:
- Driver is loaded and initialized.
- Packet arrives at the NIC from the network.
- Packet is copied (via DMA) to a ring buffer in kernel memory.
- Hardware interrupt is generated to let the system know a packet is in memory.
- Driver calls into NAPI to start a poll loop if one was not running already.
ksoftirqd
processes run on each CPU on the system. They are registered at boot time. Theksoftirqd
processes pull packets off the ring buffer by calling the NAPIpoll
function that the device driver registered during initialization.
- Memory regions in the ring buffer that have had network data written to them are unmapped.
- Data that was DMA’d into memory is passed up the networking layer as an ‘skb’ for more processing.
- Incoming network data frames are distributed among multiple CPUs if packet steering is enabled or if the NIC has multiple receive queues.
- Network data frames are handed to the protocol layers from the queues.
- Protocol layers process data.
- Data is added to receive buffers attached to sockets by protocol layers.
数据包从到达至进入套接字接收缓冲区的大致路径如下:
- 驱动程序被加载并初始化。
- 数据包从网络到达网络接口卡(NIC)。
- 数据包通过直接内存访问(DMA)被复制到内核内存中的环形缓冲区。
- 生成硬件中断,通知系统内存中有数据包。
- 如果轮询循环尚未运行,驱动程序会调用 NAPI 启动轮询循环。
ksoftirqd
进程在系统中的每个 CPU 上运行,它们在系统启动时注册。ksoftirqd
进程通过调用设备驱动程序在初始化期间注册的 NAPIpoll
函数,从环形缓冲区中取出数据包。
- 环形缓冲区中写入了网络数据的内存区域被取消映射。
- 被 DMA 写入内存的数据将作为 “skb ”传递到网络层进行进一步处理。
- 如果启用了数据包导向功能,或者 NIC 有多个接收队列,则传入的网络数据帧会在多个 CPU 之间分配。
- 网络数据帧从队列传递到协议层。
- 协议层处理数据。
- 数据由协议层添加到与套接字关联的接收缓冲区中。
This entire flow will be examined in detail in the following sections.
The protocol layers examined below are the IP and UDP protocol layers. Much of the information presented will serve as a reference for other protocol layers, as well.
以下部分将详细研究整个流程。下面所研究的协议层是 IP 和 UDP 协议层,本文中呈现的许多信息也可作为其他协议层的参考。
Detailed Look(详细分析)
This blog post will be examining the Linux kernel version 3.13.0 with links to code on GitHub and code snippets throughout this post.
本文将研究 Linux 内核版本 3.13.0,并在文中提供 GitHub 代码链接和代码片段。
Understanding exactly how packets are received in the Linux kernel is very involved. We’ll need to closely examine and understand how a network driver works, so that parts of the network stack later are more clear.
要确切理解 Linux 内核中数据包的接收方式,涉及的内容非常多。我们需要仔细研究并理解网络驱动程序的工作原理,这样后续网络栈的部分内容会更加清晰。
This blog post will look at the
igb
network driver. This driver is used for a relatively common server NIC, the Intel Ethernet Controller I350. So, let’s start by understanding how the igb
network driver works.本文将研究
igb
网络驱动程序,该驱动程序用于相对常见的服务器 NIC—— 英特尔以太网控制器 I350。那么,让我们从了解igb
网络驱动程序的工作原理开始。Network Device Driver(网络设备驱动程序)
Initialization(初始化)
A driver registers an initialization function which is called by the kernel when the driver is loaded. This function is registered by using the
module_init
macro.驱动程序会注册一个初始化函数,在驱动程序加载时由内核调用。这个函数通过使用
module_init
宏进行注册。The
igb
initialization function (igb_init_module
) and its registration with module_init
can be found in drivers/net/ethernet/intel/igb/igb_main.c.igb
初始化函数(igb_init_module
)及其通过module_init
的注册,可以在drivers/net/ethernet/intel/igb/igb_main.c
中找到。Both are fairly straightforward:
两者都相当简单:
igb_main.c
torvalds
The bulk of the work to initialize the device happens with the call to
pci_register_driver
as we’ll see next.正如我们接下来将看到的,初始化设备的大部分工作是通过调用
pci_register_driver
完成的。PCI initialization(PCI 初始化)
The Intel I350 network card is a PCI express device.
英特尔 I350 网卡是一种 PCI Express 设备。
PCI devices identify themselves with a series of registers in the PCI Configuration Space.
PCI 设备通过 PCI 配置空间中的一系列寄存器来标识自己。
When a device driver is compiled, a macro named
MODULE_DEVICE_TABLE
(from include/module.h
) is used to export a table of PCI device IDs identifying devices that the device driver can control. The table is also registered as part of a structure, as we’ll see shortly.在编译设备驱动程序时,会使用一个名为
MODULE_DEVICE_TABLE
(来自include/module.h
)的宏,导出一个 PCI 设备 ID 表,用于识别设备驱动程序可以控制的设备。该表也作为一个结构的一部分进行注册,我们很快就会看到。The kernel uses this table to determine which device driver to load to control the device.
内核使用这个表来确定加载哪个设备驱动程序来控制设备。
That’s how the OS can figure out which devices are connected to the system and which driver should be used to talk to the device.
这就是操作系统如何确定哪些设备连接到系统,以及应该使用哪个驱动程序与设备进行通信的方式。
This table and the PCI device IDs for the
igb
driver can be found in drivers/net/ethernet/intel/igb/igb_main.c
and drivers/net/ethernet/intel/igb/e1000_hw.h
, respectively:igb
驱动程序的这个表和 PCI 设备 ID 分别可以在drivers/net/ethernet/intel/igb/igb_main.c
和drivers/net/ethernet/intel/igb/e1000_hw.h
中找到:static DEFINE_PCI_DEVICE_TABLE(igb_pci_tbl) = { { PCI_VDEVICE(INTEL, E1000_DEV_ID_I354_BACKPLANE_1GBPS) }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I354_SGMII) }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I354_BACKPLANE_2_5GBPS) }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I211_COPPER), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_COPPER), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_FIBER), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SERDES), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SGMII), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_COPPER_FLASHLESS), board_82575 }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_I210_SERDES_FLASHLESS), board_82575 }, /* ... */ }; MODULE_DEVICE_TABLE(pci, igb_pci_tbl);
As seen in the previous section,
pci_register_driver
is called in the driver’s initialization function.如前所述,在驱动程序的初始化函数中会调用
pci_register_driver
。This function registers a structure of pointers. Most of the pointers are function pointers, but the PCI device ID table is also registered. The kernel uses the functions registered by the driver to bring the PCI device up.
该函数注册一个指针结构。大部分指针是函数指针,但 PCI 设备 ID 表也被注册。内核使用驱动程序注册的函数来启动 PCI 设备。
static struct pci_driver igb_driver = { .name = igb_driver_name, .id_table = igb_pci_tbl, .probe = igb_probe, .remove = igb_remove, /* ... */ };
PCI probe(PCI 探测)
Once a device has been identified by its PCI IDs, the kernel can then select the proper driver to use to control the device. Each PCI driver registers a probe function with the PCI system in the kernel. The kernel calls this function for devices which have not yet been claimed by a device driver. Once a device is claimed, other drivers will not be asked about the device. Most drivers have a lot of code that runs to get the device ready for use. The exact things done vary from driver to driver.
通过 PCI ID 识别设备后,内核就可以选择适当的驱动程序来控制设备。每个 PCI 驱动程序都会在内核中的 PCI 系统注册一个探测函数。
对于尚未被设备驱动程序认领的设备,内核会调用该函数。一旦设备被认领,其他驱动程序就不会再询问有关该设备的信息。
大多数驱动程序都有大量代码来使设备准备好投入使用,具体操作因驱动程序而异。Some typical operations to perform include:
- Enabling the PCI device.
- Requesting memory ranges and IO ports.
- Setting the DMA mask.
- The ethtool (described more below) functions the driver supports are registered.
- Any watchdog tasks needed (for example, e1000e has a watchdog task to check if the hardware is hung).
- Other device specific stuff like workarounds or dealing with hardware specific quirks or similar.
- The creation, initialization, and registration of a
struct net_device_ops
structure. This structure contains function pointers to the various functions needed for opening the device, sending data to the network, setting the MAC address, and more.
- The creation, initialization, and registration of a high level
struct net_device
which represents a network device.
一些典型的操作包括:
- 启用 PCI 设备。
- 请求内存范围和 I/O 端口。
- 设置 DMA 掩码。
- 注册驱动程序支持的 ethtool(下面会详细介绍)函数。
- 任何需要的看门狗任务(例如,e1000e 有一个看门狗任务,用于检查硬件是否挂起)。
- 其他特定于设备的操作,如解决硬件特定的问题或处理类似的硬件特性。
- 创建、初始化和注册一个
struct net_device_ops
结构,该结构包含指向打开设备、向网络发送数据、设置 MAC 地址等各种所需函数的指针。
- 创建、初始化和注册一个高级的
struct net_device
,它代表一个网络设备。
Let’s take a quick look at some of these operations in the
igb
driver in the function igb_probe
.让我们快速查看一下
igb
驱动程序中igb_probe
函数中的一些操作。A peek into PCI initialization(PCI 初始化窥探)
The following code from the
igb_probe
function does some basic PCI configuration. From drivers/net/ethernet/intel/igb/igb_main.c:以下来自
igb_probe
函数的代码进行了一些基本的 PCI 配置。在drivers/net/ethernet/intel/igb/igb_main.c
中:err = pci_enable_device_mem(pdev); /* ... */ err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64)); /* ... */ err = pci_request_selected_regions(pdev, pci_select_bars(pdev, IORESOURCE_MEM), igb_driver_name); pci_enable_pcie_error_reporting(pdev); pci_set_master(pdev); pci_save_state(pdev);
First, the device is initialized with
pci_enable_device_mem
. This will wake up the device if it is suspended, enable memory resources, and more.首先,使用
pci_enable_device_mem
初始化设备。这将唤醒处于挂起状态的设备,启用内存资源等。Next, the DMA mask will be set. This device can read and write to 64bit memory addresses, so
dma_set_mask_and_coherent
is called with DMA_BIT_MASK(64)
.接下来,设置 DMA 掩码。 由于此设备可以读写 64 位内存地址,因此调用
dma_set_mask_and_coherent
并传入DMA_BIT_MASK(64)
。Memory regions will be reserved with a call to
pci_request_selected_regions
, PCI Express Advanced Error Reporting is enabled (if the PCI AER driver is loaded), DMA is enabled with a call to pci_set_master
, and the PCI configuration space is saved with a call to pci_save_state
.通过调用
pci_request_selected_regions
预留内存区域,启用 PCI Express 高级错误报告(如果加载了 PCI AER 驱动程序),通过调用pci_set_master
启用 DMA,并通过调用pci_save_state
保存 PCI 配置空间。Phew.
pcieaer-howto.txt
torvalds
More Linux PCI driver information(更多 Linux PCI 驱动程序信息)
Going into the full explanation of how PCI devices work is beyond the scope of this post, but this excellent talk, this wiki, and this text file from the linux kernel are excellent resources.
Network device initialization(网络设备初始化)
The
igb_probe
function does some important network device initialization. In addition to the PCI specific work, it will do more general networking and network device work:- The
struct net_device_ops
is registered.
ethtool
operations are registered.
- The default MAC address is obtained from the NIC.
net_device
feature flags are set.
- And lots more.
igb_probe
函数进行了一些重要的网络设备初始化工作。除了特定于 PCI 的工作外,它还会进行更通用的网络和网络设备相关工作:- 注册
struct net_device_ops
。
- 注册
ethtool
操作。
- 从 NIC 获取默认 MAC 地址。
- 设置
net_device
功能标志。
- 还有很多其他工作。
Let’s take a look at each of these as they will be interesting later.
让我们逐个查看这些内容,因为它们在后面会很重要。
struct net_device_ops
The
struct net_device_ops
contains function pointers to lots of important operations that the network subsystem needs to control the device. We’ll be mentioning this structure many times throughout the rest of this post.struct net_device_ops
包含指向网络子系统控制设备所需的许多重要操作的函数指针。在本文的其余部分,我们会多次提到这个结构。This
net_device_ops
structure is attached to a struct net_device
in igb_probe
. From drivers/net/ethernet/intel/igb/igb_main.c)在
igb_probe
中,这个net_device_ops
结构被附加到struct net_device
上。static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent) { /* ... */ netdev->netdev_ops = &igb_netdev_ops;
And the functions that this
net_device_ops
structure holds pointers to are set in the same file. From drivers/net/ethernet/intel/igb/igb_main.c:并且这个
net_device_ops
结构所指向的函数在同一文件中设置。在drivers/net/ethernet/intel/igb/igb_main.c
中:static const struct net_device_ops igb_netdev_ops = { .ndo_open = igb_open, .ndo_stop = igb_close, .ndo_start_xmit = igb_xmit_frame, .ndo_get_stats64 = igb_get_stats64, .ndo_set_rx_mode = igb_set_rx_mode, .ndo_set_mac_address = igb_set_mac, .ndo_change_mtu = igb_change_mtu, .ndo_do_ioctl = igb_ioctl, /* ... */
As you can see, there are several interesting fields in this
struct
like ndo_open
, ndo_stop
, ndo_start_xmit
, and ndo_get_stats64
which hold the addresses of functions implemented by the igb
driver.如您所见,这个
struct
中有几个有趣的字段,如ndo_open
、ndo_stop
、ndo_start_xmit
和ndo_get_stats64
,它们保存了由igb
驱动程序实现的函数地址。We’ll be looking at some of these in more detail later.
我们稍后会更详细地查看其中一些内容。
ethtool
registration(ethtool 注册)
ethtool
is a command line program you can use to get and set various driver and hardware options. You can install it on Ubuntu by running apt-get install ethtool
.ethtool
是一个命令行程序,可用于获取和设置各种驱动程序和硬件选项。您可以在 Ubuntu 上通过运行sudo apt-get install ethtool
来安装它。A common use of
ethtool
is to gather detailed statistics from network devices. Other ethtool
settings of interest will be described later.ethtool
的一个常见用途是从网络设备收集详细的统计信息,后面将介绍其他值得关注的ethtool
设置。The
ethtool
program talks to device drivers by using the ioctl
system call. The device drivers register a series of functions that run for the ethtool
operations and the kernel provides the glue.ethtool
程序通过使用ioctl
系统调用与设备驱动程序进行通信。设备驱动程序会注册一系列用于ethtool
操作的函数,内核则提供连接机制。When an
ioctl
call is made from ethtool
, the kernel finds the ethtool
structure registered by the appropriate driver and executes the functions registered. The driver’s ethtool
function implementation can do anything from change a simple software flag in the driver to adjusting how the actual NIC hardware works by writing register values to the device.当从
ethtool
发出ioctl
调用时,内核会找到由相应驱动程序注册的ethtool
结构,并执行注册的函数。驱动程序的ethtool
函数实现可以执行各种操作,从更改驱动程序中的简单软件标志,到通过向设备写入寄存器值来调整实际 NIC 硬件的工作方式。The
igb
driver registers its ethtool
operations in igb_probe
by calling igb_set_ethtool_ops
:igb
驱动程序在igb_probe
中通过调用igb_set_ethtool_ops
来注册其ethtool
操作:static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent) { /* ... */ igb_set_ethtool_ops(netdev);
All of the
igb
driver’s ethtool
code can be found in the file drivers/net/ethernet/intel/igb/igb_ethtool.c
along with the igb_set_ethtool_ops
function.igb
驱动程序的所有ethtool
代码以及igb_set_ethtool_ops
函数都可以在drivers/net/ethernet/intel/igb/igb_ethtool.c
文件中找到。void igb_set_ethtool_ops(struct net_device *netdev) { SET_ETHTOOL_OPS(netdev, &igb_ethtool_ops); }
Above that, you can find the
igb_ethtool_ops
structure with the ethtool
functions the igb
driver supports set to the appropriate fields.在上面的代码中,您可以找到
igb_ethtool_ops
结构,其中igb
驱动程序支持的ethtool
函数被设置到相应的字段中。static const struct ethtool_ops igb_ethtool_ops = { .get_settings = igb_get_settings, .set_settings = igb_set_settings, .get_drvinfo = igb_get_drvinfo, .get_regs_len = igb_get_regs_len, .get_regs = igb_get_regs, /* ... */
It is up to the individual drivers to determine which
ethtool
functions are relevant and which should be implemented. Not all drivers implement all ethtool
functions, unfortunately.由各个驱动程序自行决定哪些
ethtool
函数是相关的,以及应该实现哪些函数。遗憾的是,并非所有驱动程序都实现了所有的ethtool
函数。One interesting
ethtool
function is get_ethtool_stats
, which (if implemented) produces detailed statistics counters that are tracked either in software in the driver or via the device itself.一个有趣的
ethtool
函数是get_ethtool_stats
,如果实现了这个函数,它会生成详细的统计计数器,这些计数器可以在驱动程序的软件中跟踪,也可以通过设备本身跟踪。The monitoring section below will show how to use
ethtool
to access these detailed statistics.下面的监控部分将展示如何使用
ethtool
来访问这些详细的统计信息。IRQs(中断)
When a data frame is written to RAM via DMA, how does the NIC tell the rest of the system that data is ready to be processed?
当数据帧通过 DMA 写入内存时,NIC 如何告知系统的其他部分数据已准备好进行处理呢?
Traditionally, a NIC would generate an interrupt request (IRQ) indicating data had arrived. There are three common types of IRQs: MSI-X, MSI, and legacy IRQs. These will be touched upon shortly. A device generating an IRQ when data has been written to RAM via DMA is simple enough, but if large numbers of data frames arrive this can lead to a large number of IRQs being generated. The more IRQs that are generated, the less CPU time is available for higher level tasks like user processes.
传统上,NIC 会生成一个中断请求(IRQ),表明数据已到达。常见的 IRQ 有三种类型:MSI-X、MSI 和传统 IRQ,稍后会简要介绍。设备在数据通过 DMA 写入内存时生成 IRQ,这本身很简单,但如果大量数据帧到达,可能会导致生成大量的 IRQ。生成的 IRQ 越多,用于更高层次任务(如用户进程)的 CPU 时间就越少。
The New Api (NAPI) was created as a mechanism for reducing the number of IRQs generated by network devices on packet arrival. While NAPI reduces the number of IRQs, it cannot eliminate them completely. We’ll see why that is, exactly, in later sections.
新的 API(NAPI)作为一种减少网络设备在数据包到达时生成 IRQ 数量的机制应运而生。虽然 NAPI 减少了 IRQ 的数量,但它无法完全消除它们,我们将在后面的章节中详细了解原因。
NAPI
NAPI differs from the legacy method of harvesting data in several important ways. NAPI allows a device driver to register a
poll
function that the NAPI subsystem will call to harvest data frames.NAPI 在几个重要方面与传统的数据收集方法不同。NAPI 允许设备驱动程序注册一个
poll
函数,NAPI 子系统会调用这个函数来收集数据帧。The intended use of NAPI in network device drivers is as follows:
- NAPI is enabled by the driver, but is in the off position initially.
- A packet arrives and is DMA’d to memory by the NIC.
- An IRQ is generated by the NIC which triggers the IRQ handler in the driver.
- The driver wakes up the NAPI subsystem using a softirq (more on these later). This will begin harvesting packets by calling the driver’s registered
poll
function in a separate thread of execution.
- The driver should disable further IRQs from the NIC. This is done to allow the NAPI subsystem to process packets without interruption from the device.
- Once there is no more work to do, the NAPI subsystem is disabled and IRQs from the device are re-enabled.
- The process starts back at step 2.
网络设备驱动程序中使用 NAPI 的预期方式如下:
- 驱动程序启用 NAPI,但最初处于关闭状态。
- 一个数据包到达,并由 NIC 通过 DMA 传输到内存。
- NIC 生成一个 IRQ,触发驱动程序中的 IRQ 处理程序。
- 驱动程序使用软中断(稍后会详细介绍)唤醒 NAPI 子系统。这将通过在一个单独的执行线程中调用驱动程序注册的
poll
函数开始收集数据包。
- 驱动程序应禁用 NIC 的进一步 IRQ,这样做是为了让 NAPI 子系统在不受设备干扰的情况下处理数据包。
- 一旦没有更多工作要做,NAPI 子系统被禁用,设备的 IRQ 重新启用。
- 该过程从步骤 2 重新开始。
This method of gathering data frames has reduced overhead compared to the legacy method because many data frames can be consumed at a time without having to deal with processing each of them one IRQ at a time.
这种收集数据帧的方法与传统方法相比,减少了开销,因为可以一次处理多个数据帧,而无需为每个数据帧单独处理一个 IRQ。
The device driver implements a
poll
function and registers it with NAPI by calling netif_napi_add
. When registering a NAPI poll
function with netif_napi_add
, the driver will also specify the weight
. Most of the drivers hardcode a value of 64
. This value and its meaning will be described in more detail below.设备驱动程序实现一个
poll
函数,并通过调用netif_napi_add
向 NAPI 注册它。在向netif_napi_add
注册 NAPI poll
函数时,驱动程序还会指定weight
,大多数驱动程序将其硬编码为64
。下面将更详细地描述这个值及其含义。Typically, drivers register their NAPI
poll
functions during driver initialization.通常,驱动程序在驱动程序初始化期间注册它们的 NAPI
poll
函数。NAPI initialization in the igb
driver(igb 驱动程序中的 NAPI 初始化)
The
igb
driver does this via a long call chain:igb_probe
callsigb_sw_init
.
igb_sw_init
callsigb_init_interrupt_scheme
.
igb_init_interrupt_scheme
callsigb_alloc_q_vectors
.
igb_alloc_q_vectors
callsigb_alloc_q_vector
.
igb_alloc_q_vector
callsnetif_napi_add
.
igb
驱动程序通过一个长调用链来完成此操作:igb_probe
调用igb_sw_init
。
igb_sw_init
调用igb_init_interrupt_scheme
。
igb_init_interrupt_scheme
调用igb_alloc_q_vectors
。
igb_alloc_q_vectors
调用igb_alloc_q_vector
。
igb_alloc_q_vector
调用netif_napi_add
。
This call trace results in a few high level things happening:
- If MSI-X is supported, it will be enabled with a call to
pci_enable_msix
.
- Various settings are computed and initialized; most notably the number of transmit and receive queues that the device and driver will use for sending and receiving packets.
igb_alloc_q_vector
is called once for every transmit and receive queue that will be created.
- Each call to
igb_alloc_q_vector
callsnetif_napi_add
to register apoll
function for that queue and an instance ofstruct napi_struct
that will be passed topoll
when called to harvest packets.
这个调用跟踪导致了一些高层次的事情发生:
- 如果支持 MSI-X,将通过调用
pci_enable_msix
启用它。
- 计算并初始化各种设置,最值得注意的是设备和驱动程序将用于发送和接收数据包的传输和接收队列的数量。
- 为每个将创建的传输和接收队列调用一次
igb_alloc_q_vector
。
- 每次对
igb_alloc_q_vector
的调用都会调用netif_napi_add
,为该队列注册一个poll
函数,以及一个struct napi_struct
实例,当调用该函数收集数据包时,这个实例将被传递给poll
函数。
Let’s take a look at
igb_alloc_q_vector
to see how the poll
callback and its private data are registered.让我们看一下
igb_alloc_q_vector
,了解poll
回调及其私有数据是如何注册的。在drivers/net/ethernet/intel/igb/igb_main.c
中:static int igb_alloc_q_vector(struct igb_adapter *adapter, int v_count, int v_idx, int txr_count, int txr_idx, int rxr_count, int rxr_idx) { /* ... */ /* allocate q_vector and rings */ q_vector = kzalloc(size, GFP_KERNEL); if (!q_vector) return -ENOMEM; /* initialize NAPI */ netif_napi_add(adapter->netdev, &q_vector->napi, igb_poll, 64); /* ... */
The above code is allocation memory for a receive queue and registering the function
igb_poll
with the NAPI subsystem. It provides a reference to the struct napi_struct
associated with this newly created RX queue (&q_vector->napi
above). This will be passed into igb_poll
when called by the NAPI subsystem when it comes time to harvest packets from this RX queue.上面的代码为接收队列分配内存,并向 NAPI 子系统注册
igb_poll
函数。它提供了一个指向与这个新创建的 RX 队列相关联的struct napi_struct
(上面的&q_vector->napi
)的引用。当 NAPI 子系统需要从这个 RX 队列收集数据包时,这个引用将被传递给igb_poll
函数。This will be important later when we examine the flow of data from drivers up the network stack.
当我们稍后检查从驱动程序到网络栈的数据流动时,这一点将很重要。
Bringing a network device up(启用网络设备)
Recall the
net_device_ops
structure we saw earlier which registered a set of functions for bringing the network device up, transmitting packets, setting the MAC address, etc.回想一下我们之前看到的
net_device_ops
结构,它注册了一组用于启用网络设备、传输数据包、设置 MAC 地址等的函数。When a network device is brought up (for example, with
ifconfig eth0 up
), the function attached to the ndo_open
field of the net_device_ops
structure is called.当启用网络设备时(例如,使用
ifconfig eth0 up
命令),会调用net_device_ops
结构中ndo_open
字段所关联的函数。The
ndo_open
function will typically do things like:- Allocate RX and TX queue memory
- Enable NAPI
- Register an interrupt handler
- Enable hardware interrupts
- And more.
ndo_open
函数通常会执行以下操作:- 分配 RX 和 TX 队列内存。
- 启用 NAPI。
- 注册一个中断处理程序。
- 启用硬件中断。
- 还有更多操作。
In the case of the
igb
driver, the function attached to the ndo_open
field of the net_device_ops
structure is called igb_open
.在
igb
驱动程序的情况下,net_device_ops
结构中ndo_open
字段所关联的函数名为igb_open
。Preparing to receive data from the network(准备从网络接收数据)
Most NICs you’ll find today will use DMA to write data directly into RAM where the OS can retrieve the data for processing. The data structure most NICs use for this purpose resembles a queue built on circular buffer (or a ring buffer).
目前,大多数网卡都使用 DMA 将数据直接写入 RAM,操作系统可以在 RAM 中读取数据进行处理。大多数 NIC 为此使用的数据结构类似于建立在圆形缓冲区(或环形缓冲区)上的队列。
In order to do this, the device driver must work with the OS to reserve a region of memory that the NIC hardware can use. Once this region is reserved, the hardware is informed of its location and incoming data will be written to RAM where it will later be picked up and processed by the networking subsystem.
为了实现这一点,设备驱动程序必须与操作系统合作,保留一块 NIC 硬件可以使用的内存区域。一旦保留了这个区域,硬件就会得知其位置,传入的数据将被写入内存,稍后将由网络子系统提取并处理。
This seems simple enough, but what if the packet rate was high enough that a single CPU was not able to properly process all incoming packets? The data structure is built on a fixed length region of memory, so incoming packets would be dropped.
这看起来很简单,但如果数据包速率足够高,以至于单个 CPU 无法正确处理所有传入的数据包会怎样呢?由于数据结构是基于固定长度的内存区域构建的,传入的数据包可能会被丢弃。
This is where something known as known as Receive Side Scaling (RSS) or multiqueue can help.
这就是所谓的接收端缩放(RSS)或多队列技术可以发挥作用的地方。
Some devices have the ability to write incoming packets to several different regions of RAM simultaneously; each region is a separate queue. This allows the OS to use multiple CPUs to process incoming data in parallel, starting at the hardware level. This feature is not supported by all NICs.
有些设备可以将接收到的数据包同时写入 RAM 的多个不同区域;每个区域都是一个单独的队列。这样,操作系统就可以从硬件层面开始,使用多个 CPU 并行处理传入数据。并非所有 NIC 都支持此功能。
The Intel I350 NIC does support multiple queues. We can see evidence of this in the
igb
driver. One of the first things the igb
driver does when it is brought up is call a function named igb_setup_all_rx_resources
. This function calls another function, igb_setup_rx_resources
, once for each RX queue to arrange for DMA-able memory where the device will write incoming data.英特尔 I350 NIC 支持多个队列。我们可以在
igb
驱动程序中看到这一点的证据。igb
驱动程序启动时首先要做的事情之一就是调用一个名为igb_setup_all_rx_resources
的函数。这个函数会为每个 RX 队列调用另一个函数igb_setup_rx_resources
一次,为设备写入传入数据安排可进行 DMA 操作的内存。If you are curious how exactly this works, please see the Linux kernel’s DMA API HOWTO.
如果您对具体的工作方式感到好奇,请查看 Linux 内核的DMA API 指南。
It turns out the number and size of the RX queues can be tuned by using
ethtool
. Tuning these values can have a noticeable impact on the number of frames which are processed vs the number of frames which are dropped.事实证明,RX 队列的数量和大小可以使用
ethtool
进行调整。调整这些值对处理的帧数与丢弃的帧数会有显著影响。The NIC uses a hash function on the packet header fields (like source, destination, port, etc) to determine which RX queue the data should be directed to.
NIC 使用数据包头部字段(如源地址、目的地址、端口等)上的哈希函数来确定数据应该被定向到哪个 RX 队列。
Some NICs let you adjust the weight of the RX queues, so you can send more traffic to specific queues.
一些 NIC 允许您调整 RX 队列的权重,这样您就可以将更多流量发送到特定的队列。
Fewer NICs let you adjust this hash function itself. If you can adjust the hash function, you can send certain flows to specific RX queues for processing or even drop the packets at the hardware level, if desired.
更少的 NIC 允许您调整这个哈希函数本身。如果您可以调整哈希函数,您可以将特定的流量发送到特定的 RX 队列进行处理,甚至可以根据需要在硬件级别丢弃数据包。
We’ll take a look at how to tune these settings shortly.
我们稍后将了解如何调整这些设置。
Enable NAPI(启用 NAPI)
When a network device is brought up, a driver will usually enable NAPI.
当网络设备启动时,驱动程序通常会启用 NAPI。
We saw earlier how drivers register
poll
functions with NAPI, but NAPI is not usually enabled until the device is brought up.我们之前看到了驱动程序如何向 NAPI 注册
poll
函数,但 NAPI 通常在设备启动之前不会被启用。Enabling NAPI is relatively straight forward. A call to
napi_enable
will flip a bit in the struct napi_struct
to indicate that it is now enabled. As mentioned above, while NAPI will be enabled it will be in the off position.启用 NAPI 相对简单,调用
napi_enable
会在struct napi_struct
中翻转一个位,以表明它现在已启用。如前所述,虽然 NAPI 会被启用,但它将处于关闭位置。In the case of the
igb
driver, NAPI is enabled for each q_vector
that was initialized when the driver was loaded or when the queue count or size are changed with ethtool
.对于
igb
驱动程序,NAPI 会在加载驱动程序时或使用 ethtool
更改队列数或队列大小时为每个初始化的 q_vector
启用。for (i = 0; i < adapter->num_q_vectors; i++) napi_enable(&(adapter->q_vector[i]->napi));
Register an interrupt handler(注册中断处理程序)
After enabling NAPI, the next step is to register an interrupt handler. There are different methods a device can use to signal an interrupt: MSI-X, MSI, and legacy interrupts. As such, the code differs from device to device depending on what the supported interrupt methods are for a particular piece of hardware.
启用 NAPI 后,下一步是注册一个中断处理程序。设备可以使用不同的方法来发出中断信号:MSI-X、MSI 和传统中断。因此,根据特定硬件支持的中断方法,不同设备的代码也有所不同。
The driver must determine which method is supported by the device and register the appropriate handler function that will execute when the interrupt is received.
驱动程序必须确定设备支持哪种方法,并注册在接收到中断时将执行的适当处理程序函数。
Some drivers, like the
igb
driver, will try to register an interrupt handler with each method, falling back to the next untested method on failure.一些驱动程序,如
igb
驱动程序,会尝试使用每种方法注册一个中断处理程序,如果失败则回退到下一个未测试的方法。MSI-X interrupts are the preferred method, especially for NICs that support multiple RX queues. This is because each RX queue can have its own hardware interrupt assigned, which can then be handled by a specific CPU (with
irqbalance
or by modifying /proc/irq/IRQ_NUMBER/smp_affinity
). As we’ll see shortly, the CPU that handles the interrupt will be the CPU that processes the packet. In this way, arriving packets can be processed by separate CPUs from the hardware interrupt level up through the networking stack.MSI-X 中断是首选方法,特别是对于支持多个 RX 队列的 NIC。这是因为每个 RX 队列可以有自己的硬件中断分配,然后可以由特定的 CPU 处理(通过
irqbalance
或通过修改/proc/irq/IRQ_NUMBER/smp_affinity
)。正如我们稍后将看到的,处理中断的 CPU 将是处理数据包的 CPU。通过这种方式,从硬件中断级别到网络栈,到达的数据包可以由不同的 CPU 进行处理。If MSI-X is unavailable, MSI still presents advantages over legacy interrupts and will be used by the driver if the device supports it. Read this useful wiki page for more information about MSI and MSI-X.
如果 MSI-X 不可用,MSI 仍然比传统中断具有优势,如果设备支持,驱动程序将使用它。有关 MSI 和 MSI-X 的更多信息,请阅读这个有用的维基页面。
In the
igb
driver, the functions igb_msix_ring
, igb_intr_msi
, igb_intr
are the interrupt handler methods for the MSI-X, MSI, and legacy interrupt modes, respectively.在
igb
驱动程序中,igb_msix_ring
、igb_intr_msi
、igb_intr
函数分别是 MSI-X、MSI 和传统中断模式的中断处理程序方法。You can find the code in the driver which attempts each interrupt method in drivers/net/ethernet/intel/igb/igb_main.c:
您可以在
drivers/net/ethernet/intel/igb/igb_main.c
中的驱动程序代码中找到尝试每种中断方法的代码:static int igb_request_irq(struct igb_adapter *adapter) { struct net_device *netdev = adapter->netdev; struct pci_dev *pdev = adapter->pdev; int err = 0; if (adapter->msix_entries) { err = igb_request_msix(adapter); if (!err) goto request_done; /* fall back to MSI */ /* ... */ } /* ... */ if (adapter->flags & IGB_FLAG_HAS_MSI) { err = request_irq(pdev->irq, igb_intr_msi, 0, netdev->name, adapter); if (!err) goto request_done; /* fall back to legacy interrupts */ /* ... */ } err = request_irq(pdev->irq, igb_intr, IRQF_SHARED, netdev->name, adapter); if (err) dev_err(&pdev->dev, "Error %d getting interrupt\n", err); request_done: return err; }
As you can see in the abbreviated code above, the driver first attempts to set an MSI-X interrupt handler with
igb_request_msix
, falling back to MSI on failure. Next, request_irq
is used to register igb_intr_msi
, the MSI interrupt handler. If this fails, the driver falls back to legacy interrupts. request_irq
is used again to register the legacy interrupt handler igb_intr
.正如您在上面的缩写代码中看到的,驱动程序首先尝试使用
igb_request_msix
设置 MSI-X 中断处理程序,如果失败则回退到 MSI。接下来,使用request_irq
注册igb_intr_msi
,即 MSI 中断处理程序。如果这也失败,驱动程序回退到传统中断,再次使用request_irq
注册传统中断处理程序igb_intr
。And this is how the
igb
driver registers a function that will be executed when the NIC raises an interrupt signaling that data has arrived and is ready for processing.这就是
igb
驱动程序注册一个函数的方式,当 NIC 发出中断信号表明数据已到达并准备好进行处理时,该函数将被执行。Enable Interrupts(启用中断)
At this point, almost everything is setup. The only thing left is to enable interrupts from the NIC and wait for data to arrive. Enabling interrupts is hardware specific, but the
igb
driver does this in __igb_open
by calling a helper function named igb_irq_enable
.此时,几乎所有设置都已完成。剩下的唯一事情是启用 NIC 的中断并等待数据到达。启用中断是特定于硬件的,但
igb
驱动程序在__igb_open
中通过调用一个名为igb_irq_enable
的辅助函数来完成此操作。Interrupts are enabled for this device by writing to registers:
通过向寄存器写入来为这个设备启用中断:
static void igb_irq_enable(struct igb_adapter *adapter) { /* ... */ wr32(E1000_IMS, IMS_ENABLE_MASK | E1000_IMS_DRSTA); wr32(E1000_IAM, IMS_ENABLE_MASK | E1000_IMS_DRSTA); /* ... */ }
The network device is now up(网络设备现已启动)
Drivers may do a few more things like start timers, work queues, or other hardware-specific setup. Once that is completed. the network device is up and ready for use.
驱动程序可能会执行一些其他操作,如启动定时器、工作队列或其他特定于硬件的设置。一旦完成这些操作,网络设备就启动并准备好使用了。
Let’s take a look at monitoring and tuning settings for network device drivers.
让我们来看看网络设备驱动程序的监控和调整设置。
Monitoring network devices(监控网络设备)
There are several different ways to monitor your network devices offering different levels of granularity and complexity. Let’s start with most granular and move to least granular.
有几种不同的方法可以监控网络设备,它们提供不同程度的粒度和复杂度。让我们从最细粒度的方法开始,逐步转向最粗粒度的方法。
Using ethtool -S
(使用 ethtool -S)
You can install
ethtool
on an Ubuntu system by running: sudo apt-get install ethtool
.您可以在 Ubuntu 系统上通过运行
sudo apt-get install ethtool
来安装ethtool
。Once it is installed, you can access the statistics by passing the
-S
flag along with the name of the network device you want statistics about.安装完成后,您可以通过传递
-S
标志以及您想要获取统计信息的网络设备名称来访问这些统计信息。Monitor detailed NIC device statistics (e.g., packet drops) with
ethtool -S
使用
ethtool -S
监控详细的 NIC 设备统计信息(例如,数据包丢弃情况):$ sudo ethtool -S eth0 NIC statistics: rx_packets: 597028087 tx_packets: 5924278060 rx_bytes: 112643393747 tx_bytes: 990080156714 rx_broadcast: 96 tx_broadcast: 116 rx_multicast: 20294528 ....
Monitoring this data can be difficult. It is easy to obtain, but there is no standardization of the field values. Different drivers, or even different versions of the same driver might produce different field names that have the same meaning.
监控这些数据可能很困难。虽然数据很容易获取,但字段值没有标准化。不同的驱动程序,甚至同一驱动程序的不同版本,可能会产生具有相同含义但名称不同的字段。
You should look for values with “drop”, “buffer”, “miss”, etc in the label. Next, you will have to read your driver source. You’ll be able to determine which values are accounted for totally in software (e.g., incremented when there is no memory) and which values come directly from hardware via a register read. In the case of a register value, you should consult the data sheet for your hardware to determine what the meaning of the counter really is; many of the labels given via
ethtool
can be misleading.您应该查找标签中带有 “drop”(丢弃)、“buffer”(缓冲区)、“miss”(错过)等字样的值。 接下来,您必须阅读驱动程序源代码,这样您就能确定哪些值完全在软件中统计(例如,在没有内存时递增),哪些值是通过寄存器读取直接从硬件获取的。 对于寄存器值,您应该查阅硬件的数据表,以确定计数器的真正含义; 通过
ethtool
给出的许多标签可能会产生误导。Using sysfs(使用 sysfs)
sysfs also provides a lot of statistics values, but they are slightly higher level than the direct NIC level stats provided.
sysfs 也提供了许多统计值,但它们比直接的 NIC 级统计信息稍微高级一些。
You can find the number of dropped incoming network data frames for, e.g. eth0 by using
cat
on a file.例如,您可以使用
cat
命令查看eth0
的传入网络数据帧的丢弃数量。Monitor higher level NIC statistics with sysfs.
使用 sysfs 监控更高级别的 NIC 统计信息:
$ cat /sys/class/net/eth0/statistics/rx_dropped 2
The counter values will be split into files like
collisions
, rx_dropped
, rx_errors
, rx_missed_errors
, etc.计数器值会被分割到像
collisions
(冲突)、rx_dropped
(接收丢弃)、rx_errors
(接收错误)、rx_missed_errors
(接收错过错误)等文件中。Unfortunately, it is up to the drivers to decide what the meaning of each field is, and thus, when to increment them and where the values come from. You may notice that some drivers count a certain type of error condition as a drop, but other drivers may count the same as a miss.
不幸的是,每个字段的含义由驱动程序决定,因此,何时递增这些字段以及这些值的来源也由驱动程序决定。您可能会注意到,一些驱动程序将某种类型的错误情况计为丢弃,而其他驱动程序可能将相同情况计为错过。
If these values are critical to you, you will need to read your driver source to understand exactly what your driver thinks each of these values means.
如果这些值对您很关键,您需要阅读驱动程序源代码,以准确理解您的驱动程序对每个这些值的理解。
Using /proc/net/dev
(使用 /proc/net/dev)
An even higher level file is
/proc/net/dev
which provides high-level summary-esque information for each network adapter on the system.更高级别的文件是
/proc/net/dev
,它为系统上的每个网络适配器提供高级别的汇总信息。Monitor high level NIC statistics by reading
/proc/net/dev
.通过读取
/proc/net/dev
监控高级别的 NIC 统计信息:$ cat /proc/net/dev Inter-| Receive | Transmit face | bytes packets errs drop fifo frame compressed multicast | bytes packets errs drop fifo colls carrier compressed eth0: 110346752214 597737500 0 2 0 0 0 20963860 990024805984 6066582604 0 0 0 0 0 0 lo: 428349463836 1579868535 0 0 0 0 0 0 428349463836 1579868535 0 0 0 0 0 0
This file shows a subset of the values you’ll find in the sysfs files mentioned above, but it may serve as a useful general reference.
这个文件显示了您在上面提到的 sysfs 文件中会找到的值的一个子集,但它可能是一个有用的通用参考。
The caveat mentioned above applies here, as well: if these values are important to you, you will still need to read your driver source to understand exactly when, where, and why they are incremented to ensure your understanding of an error, drop, or fifo are the same as your driver.
上面提到的注意事项在此处同样适用:如果这些值对您很重要,您仍然需要阅读驱动程序源代码,以准确了解它们何时、何地以及为何递增,以确保您对错误、丢包或 FIFO(先进先出队列)的理解与驱动程序一致。
Tuning network devices(调整网络设备)
Check the number of RX queues being used(检查正在使用的 RX 队列数量)
If your NIC and the device driver loaded on your system support RSS / multiqueue, you can usually adjust the number of RX queues (also called RX channels), by using
ethtool
.如果你的 NIC 和系统上加载的设备驱动程序支持 RSS(接收端缩放)/ 多队列功能,通常可以使用
ethtool
调整 RX 队列(也称为 RX 通道)的数量。Check the number of NIC receive queues with
ethtool
使用
ethtool
检查 NIC 接收队列的数量:$ sudo ethtool -l eth0 Channel parameters for eth0: Pre-set maximums: RX: 0 TX: 0 Other: 0 Combined: 8 Current hardware settings: RX: 0 TX: 0 Other: 0 Combined: 4
This output is displaying the pre-set maximums (enforced by the driver and the hardware) and the current settings.
此输出显示了预设的最大值(由驱动程序和硬件强制执行)和当前设置。
Note: not all device drivers will have support for this operation.
注意:并非所有设备驱动程序都支持此操作。
Error seen if your NIC doesn't support this operation.
如果你的 NIC 不支持此操作,会看到以下错误:
$ sudo ethtool -l eth0 Channel parameters for eth0: Cannot get device channel parameters : Operation not supported
This means that your driver has not implemented the ethtool
get_channels
operation. This could be because the NIC doesn’t support adjusting the number of queues, doesn’t support RSS / multiqueue, or your driver has not been updated to handle this feature.这意味着你的驱动程序未实现
ethtool
的get_channels
操作。这可能是因为 NIC 不支持调整队列数量、不支持 RSS / 多队列功能,或者你的驱动程序尚未更新以处理此功能。Adjusting the number of RX queues(调整 RX 队列数量)
Once you’ve found the current and maximum queue count, you can adjust the values by using
sudo ethtool -L
.找到当前和最大队列数量后,可以使用
sudo ethtool -L
调整这些值。Note: some devices and their drivers only support combined queues that are paired for transmit and receive, as in the example in the above section.
注意:一些设备及其驱动程序仅支持成对的发送和接收组合队列,如上面部分中的示例。
Set combined NIC transmit and receive queues to 8 with
ethtool -L
使用
ethtool -L
将 NIC 的组合发送和接收队列设置为 8:$ sudo ethtool -L eth0 combined 8
If your device and driver support individual settings for RX and TX and you’d like to change only the RX queue count to 8, you would run:
Set the number of NIC receive queues to 8 with
ethtool -L
.如果你的设备和驱动程序支持对 RX 和 TX 进行单独设置,并且你只想将 RX 队列数量更改为 8,可以运行:
$ sudo ethtool -L eth0 rx 8
Note: making these changes will, for most drivers, take the interface down and then bring it back up; connections to this interface will be interrupted. This may not matter much for a one-time change, though.
注意:对于大多数驱动程序,进行这些更改会使网络接口先关闭再重新启动,与该接口的连接将被中断。不过,对于一次性更改而言,这可能影响不大。
Adjusting the size of the RX queues(调整 RX 队列大小)
Some NICs and their drivers also support adjusting the size of the RX queue. Exactly how this works is hardware specific, but luckily
ethtool
provides a generic way for users to adjust the size. Increasing the size of the RX queue can help prevent network data drops at the NIC during periods where large numbers of data frames are received. Data may still be dropped in software, though, and other tuning is required to reduce or eliminate drops completely.一些 NIC 及其驱动程序还支持调整 RX 队列的大小。具体的实现方式因硬件而异,但幸运的是,
ethtool
为用户提供了一种通用的调整大小的方法。增加 RX 队列的大小有助于防止在接收大量数据帧期间 NIC 丢弃网络数据。不过,数据仍可能在软件层面被丢弃,还需要进行其他调整以减少或完全消除丢包。Check current NIC queue sizes with
ethtool -g
使用
ethtool -g
检查当前 NIC 队列大小:$ sudo ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 512 RX Mini: 0 RX Jumbo: 0 TX: 512
the above output indicates that the hardware supports up to 4096 receive and transmit descriptors, but it is currently only using 512.
上述输出表明,硬件最多支持 4096 个接收和传输描述符,但目前仅使用了 512 个。
Increase size of each RX queue to 4096 with
ethtool -G
使用
ethtool -G
将每个 RX 队列的大小增加到 4096:$ sudo ethtool -G eth0 rx 4096
Note: making these changes will, for most drivers, take the interface down and then bring it back up; connections to this interface will be interrupted. This may not matter much for a one-time change, though.
注意:对于大多数驱动程序,进行这些更改会使网络接口先关闭再重新启动,与该接口的连接将被中断。不过,对于一次性更改而言,这可能影响不大。
Adjusting the processing weight of RX queues(调整 RX 队列的处理权重)
Some NICs support the ability to adjust the distribution of network data among the RX queues by setting a weight.
一些 NIC 支持通过设置权重来调整网络数据在 RX 队列之间的分配。
You can configure this if:
- Your NIC supports flow indirection.
- Your driver implements the
ethtool
functionsget_rxfh_indir_size
andget_rxfh_indir
.
- You are running a new enough version of
ethtool
that has support for the command line optionsx
andX
to show and set the indirection table, respectively.
如果满足以下条件,你可以配置此设置:
- 你的 NIC 支持流间接功能。
- 你的驱动程序实现了
ethtool
函数get_rxfh_indir_size
和get_rxfh_indir
。
- 你运行的
ethtool
版本足够新,支持命令行选项x
和X
,分别用于显示和设置间接表。
Check the RX flow indirection table with
ethtool -x
使用
ethtool -x
检查 RX 流间接表:$ sudo ethtool -x eth0 RX flow hash indirection table for eth3 with 2 RX ring(s): 0: 0 1 0 1 0 1 0 1 8: 0 1 0 1 0 1 0 1 16: 0 1 0 1 0 1 0 1 24: 0 1 0 1 0 1 0 1
This output shows packet hash values on the left, with receive queue 0 and 1 listed. So, a packet which hashes to 2 will be delivered to receive queue 0, while a packet which hashes to 3 will be delivered to receive queue 1.
此输出在左侧显示数据包哈希值,右侧列出接收队列 0 和 1。因此,哈希值为 2 的数据包将被发送到接收队列 0,而哈希值为 3 的数据包将被发送到接收队列 1。
Example: spread processing evenly between first 2 RX queues
示例:在最初的 2 个 RX 队列之间平均分配处理任务
$ sudo ethtool -X eth0 equal 2
If you want to set custom weights to alter the number of packets which hit certain receive queues (and thus CPUs), you can specify those on the command line, as well:
如果你想设置自定义权重以改变到达特定接收队列(进而到达特定 CPU)的数据包数量,也可以在命令行中指定这些权重:
Set custom RX queue weights with
ethtool -X
使用
ethtool -X
设置自定义 RX 队列权重:$ sudo ethtool -X eth0 weight 6 2
The above command specifies a weight of 6 for rx queue 0 and 2 for rx queue 1, pushing much more data to be processed on queue 0.
上述命令为 rx 队列 0 指定权重为 6,为 rx 队列 1 指定权重为 2,从而使更多数据在队列 0 上进行处理。
Some NICs will also let you adjust the fields which be used in the hash algorithm, as we’ll see now.
一些 NIC 还允许你调整哈希算法中使用的字段,我们现在来了解一下。
Adjusting the rx hash fields for network flows(调整网络流的 rx 哈希字段)
You can use
ethtool
to adjust the fields that will be used when computing a hash for use with RSS.你可以使用
ethtool
调整用于计算 RSS 哈希值时使用的字段。Check which fields are used for UDP RX flow hash with
ethtool -n
.使用
ethtool -n
检查用于 UDP RX 流哈希的字段:$ sudo ethtool -n eth0 rx-flow-hash udp4 UDP over IPV4 flows use these fields for computing Hash flow key: IP SA IP DA
For eth0, the fields that are used for computing a hash on UDP flows is the IPv4 source and destination addresses. Let’s include the source and destination ports:
对于 eth0,用于计算 UDP 流哈希值的字段是 IPv4 源地址和目的地址。让我们添加源端口和目的端口:
Set UDP RX flow hash fields with
ethtool -N
.使用
ethtool -N
设置 UDP RX 流哈希字段:$ sudo ethtool -N eth0 rx-flow-hash udp4 sdfn
The
sdfn
string is a bit cryptic; check the ethtool
man page for an explanation of each letter.sdfn
这个字符串有点晦涩难懂,有关每个字母的解释,请查看ethtool
的手册页。Adjusting the fields to take a hash on is useful, but
ntuple
filtering is even more useful for finer grained control over which flows will be handled by which RX queue.调整用于哈希计算的字段很有用,但
ntuple
过滤对于更精细地控制哪些流由哪个 RX 队列处理更为有用。ntuple filtering for steering network flows(ntuple 过滤以引导网络流)
Some NICs support a feature known as “ntuple filtering.” This feature allows the user to specify (via
ethtool
) a set of parameters to use to filter incoming network data in hardware and queue it to a particular RX queue. For example, the user can specify that TCP packets destined to a particular port should be sent to RX queue 1.一些 NIC 支持一种称为 “ntuple 过滤” 的功能。此功能允许用户通过
ethtool
指定一组参数,在硬件中对传入的网络数据进行过滤,并将其排队到特定的 RX 队列。例如,用户可以指定目标端口为特定端口的 TCP 数据包应发送到 RX 队列 1。On Intel NICs this feature is commonly known as Intel Ethernet Flow Director. Other NIC vendors may have other marketing names for this feature.
在英特尔 NIC 上,此功能通常称为英特尔以太网流导向器。其他 NIC 供应商可能对此功能有不同的营销名称。
As we’ll see later, ntuple filtering is a crucial component of another feature called Accelerated Receive Flow Steering (aRFS), which makes using ntuple much easier if your NIC supports it. aRFS will be covered later.
正如我们稍后将看到的,ntuple 过滤是另一个称为加速接收流导向(aRFS)的功能的关键组成部分,如果你的 NIC 支持 aRFS,它会使使用 ntuple 变得更加容易。稍后将介绍 aRFS。
This feature can be useful if the operational requirements of the system involve maximizing data locality with the hope of increasing CPU cache hit rates when processing network data. For example consider the following configuration for a webserver running on port 80:
如果系统的操作要求涉及最大化数据局部性,以期在处理网络数据时提高 CPU 缓存命中率,那么此功能会很有用。例如,考虑在端口 80 上运行的 Web 服务器的以下配置:
- A webserver running on port 80 is pinned to run on CPU 2.
- IRQs for an RX queue are assigned to be processed by CPU 2.
- TCP traffic destined to port 80 is ‘filtered’ with ntuple to CPU 2.
- All incoming traffic to port 80 is then processed by CPU 2 starting at data arrival to the userland program.
- Careful monitoring of the system including cache hit rates and networking stack latency will be needed to determine effectiveness.
- 在端口 80 上运行的 Web 服务器被绑定到 CPU 2 上运行。
- RX 队列的 IRQ 被分配由 CPU 2 处理。
- 目标端口为 80 的 TCP 流量通过 ntuple “过滤” 到 CPU 2。
- 从数据到达用户态程序开始,所有发往端口 80 的传入流量都由 CPU 2 处理。
- 需要仔细监控系统,包括缓存命中率和网络栈延迟,以确定其有效性。
As mentioned, ntuple filtering can be configured with
ethtool
, but first, you’ll need to ensure that this feature is enabled on your device.如前所述,可以使用
ethtool
配置 ntuple 过滤,但首先,你需要确保设备上启用了此功能。Check if ntuple filters are enabled with
ethtool -k
使用
ethtool -k
检查 ntuple 过滤器是否启用:$ sudo ethtool -k eth0 Offload parameters for eth0: ... ntuple-filters: off receive-hashing: on
As you can see,
ntuple-filters
are set to off on this device.如你所见,此设备上的
ntuple-filters
设置为关闭。Enable ntuple filters with
ethtool -K
使用
ethtool -K
启用 ntuple 过滤器:$ sudo ethtool -K eth0 ntuple on
Once you’ve enabled ntuple filters, or verified that it is enabled, you can check the existing ntuple rules by using
ethtool
:启用 ntuple 过滤器后,或者确认其已启用后,可以使用
ethtool
检查现有的 ntuple 规则:Check existing ntuple filters with
ethtool -u
使用
ethtool -u
检查现有的 ntuple 过滤器:$ sudo ethtool -u eth0 40 RX rings available Total 0 rules
As you can see, this device has no ntuple filter rules. You can add a rule by specifying it on the command line to
ethtool
. Let’s add a rule to direct all TCP traffic with a destination port of 80 to RX queue 2:如你所见,此设备没有 ntuple 过滤规则。你可以在命令行中向
ethtool
指定规则来添加一个。让我们添加一个规则,将所有目标端口为 80 的 TCP 流量定向到 RX 队列 2:Add ntuple filter to send TCP flows with destination port 80 to RX queue 2
$ sudo ethtool -U eth0 flow-type tcp4 dst-port 80 action 2
You can also use ntuple filtering to drop packets for particular flows at the hardware level. This can be useful for mitigating heavy incoming traffic from specific IP addresses. For more information about configuring ntuple filter rules, see the
ethtool
man page.你还可以使用 ntuple 过滤在硬件级别丢弃特定流的数据包。这对于减轻来自特定 IP 地址的大量传入流量很有用。有关配置 ntuple 过滤规则的更多信息,请查看
ethtool
的手册页。You can usually get statistics about the success (or failure) of your ntuple rules by checking values output from
ethtool -S [device name]
. For example, on Intel NICs, the statistics fdir_match
and fdir_miss
calculate the number of matches and misses for your ntuple filtering rules. Consult your device driver source and device data sheet for tracking down statistics counters (if available).通常,你可以通过检查
ethtool -S [设备名称]
输出的值来获取 ntuple 规则成功(或失败)的统计信息。例如,在英特尔 NIC 上,fdir_match
和fdir_miss
统计信息计算 ntuple 过滤规则的匹配次数和未匹配次数。查阅设备驱动程序源代码和设备数据表,以查找统计计数器(如果可用)。SoftIRQs(软中断)
Before examining the network stack, we’ll need to take a short detour to examine something in the Linux kernel called SoftIRQs.
在研究网络栈之前,我们需要先绕个小弯,研究一下 Linux 内核中一个叫做软中断(SoftIRQs)的东西。
What is a softirq?(什么是软中断?)
The softirq system in the Linux kernel is a mechanism for executing code outside of the context of an interrupt handler implemented in a driver. This system is important because hardware interrupts may be disabled during all or part of the execution of an interrupt handler. The longer interrupts are disabled, the greater chance that events may be missed. So, it is important to defer any long running actions outside of the interrupt handler so that it can complete as quickly as possible and re-enable interrupts from the device.
Linux 内核中的软中断系统是一种在驱动程序实现的中断处理程序上下文之外执行代码的机制。这个系统很重要,因为在中断处理程序执行的全部或部分时间内,硬件中断可能会被禁用。中断被禁用的时间越长,错过事件的可能性就越大。因此,将任何长时间运行的操作推迟到中断处理程序之外执行非常重要,这样中断处理程序就可以尽快完成,并重新启用设备的中断。
There are other mechanisms that can be used for deferring work in the kernel, but for the purposes of the networking stack, we’ll be looking at softirqs.
在内核中还有其他机制可用于推迟工作,但就网络栈而言,我们将关注软中断。
The softirq system can be imagined as a series of kernel threads (one per CPU) that run handler functions which have been registered for different softirq events. If you’ve ever looked at top and seen
ksoftirqd/0
in the list of kernel threads, you were looking at the softirq kernel thread running on CPU 0.软中断系统可以想象为一系列内核线程(每个 CPU 一个),它们运行针对不同软中断事件注册的处理函数。如果你曾经在
top
命令中看到ksoftirqd/0
在内核线程列表中,那么你看到的就是在 CPU 0 上运行的软中断内核线程。Kernel subsystems (like networking) can register a softirq handler by executing the
open_softirq
function. We’ll see later how the networking system registers its softirq handlers. For now, let’s learn a bit more about how softirqs work.内核子系统(如网络子系统)可以通过执行
open_softirq
函数注册一个软中断处理程序。稍后我们将看到网络系统如何注册其软中断处理程序。现在,让我们进一步了解软中断的工作原理。ksoftirqd
Since softirqs are so important for deferring the work of device drivers, you might imagine that the
ksoftirqd
process is spawned pretty early in the life cycle of the kernel and you’d be correct.由于软中断对于推迟设备驱动程序的工作非常重要,你可能会认为
ksoftirqd
进程在内核生命周期的早期就会被创建,你想得没错。Looking at the code found in kernel/softirq.c reveals how the
ksoftirqd
system is initialized:查看
kernel/softirq.c
中的代码,可以了解ksoftirqd
系统是如何初始化的:static struct smp_hotplug_thread softirq_threads = { .store = &ksoftirqd, .thread_should_run = ksoftirqd_should_run, .thread_fn = run_ksoftirqd, .thread_comm = "ksoftirqd/%u", }; static __init int spawn_ksoftirqd(void) { register_cpu_notifier(&cpu_nfb); BUG_ON(smpboot_register_percpu_thread(&softirq_threads)); return 0; } early_initcall(spawn_ksoftirqd);
As you can see from the
struct smp_hotplug_thread
definition above, there are two function pointers being registered: ksoftirqd_should_run
and run_ksoftirqd
.从上面的
struct smp_hotplug_thread
定义中可以看到,有两个函数指针被注册:ksoftirqd_should_run
和run_ksoftirqd
。Both of these functions are called from kernel/smpboot.c as part of something which resembles an event loop.
这两个函数都在内核的
smpboot.c
中被调用,作为类似事件循环的一部分。The code in
kernel/smpboot.c
first calls ksoftirqd_should_run
which determines if there are any pending softirqs and, if there are pending softirqs, run_ksoftirqd
is executed. The run_ksoftirqd
does some minor bookkeeping before it calls __do_softirq
.smpboot.c
中的代码首先调用ksoftirqd_should_run
,它会确定是否有任何挂起的软中断,如果有,则执行run_ksoftirqd
。run_ksoftirqd
在调用__do_softirq
之前会进行一些小的簿记工作。__do_softirq
The
__do_softirq
function does a few interesting things:- determines which softirq is pending
- softirq time is accounted for statistics purposes
- softirq execution statistics are incremented
- the softirq handler for the pending softirq (which was registered with a call to
open_softirq
) is executed.
_do_softirq
函数执行了一些有趣的操作:- 确定哪个软中断处于挂起状态。
- 统计软中断时间,用于统计目的。
- 增加软中断执行统计信息。
- 执行针对挂起软中断注册的软中断处理程序(通过调用
open_softirq
注册)。
So, when you look at graphs of CPU usage and see
softirq
or si
you now know that this is measuring the amount of CPU usage happening in a deferred work context.所以,当你查看 CPU 使用情况图表并看到
softirq
或si
时,现在你知道这是在测量在推迟工作上下文中发生的 CPU 使用量。Monitoring(监控)
/proc/softirqs
The
softirq
system increments statistic counters which can be read from /proc/softirqs
Monitoring these statistics can give you a sense for the rate at which softirqs for various events are being generated.软中断系统会增加统计计数器,可以从
/proc/softirqs
读取这些计数器。监控这些统计信息可以让你了解各种事件的软中断生成速率。Check softIRQ stats by reading
/proc/softirqs
.通过读取
/proc/softirqs
检查软中断统计信息:$ cat /proc/softirqs CPU0 CPU1 CPU2 CPU3 HI: 0 0 0 0 TIMER: 2831512516 1337085411 1103326083 1423923272 NET_TX: 15774435 779806 733217 749512 NET_RX: 1671622615 1257853535 2088429526 2674732223 BLOCK: 1800253852 1466177 1791366 634534 BLOCK_IOPOLL: 0 0 0 0 TASKLET: 25 0 0 0 SCHED: 2642378225 1711756029 629040543 682215771 HRTIMER: 2547911 2046898 1558136 1521176 RCU: 2056528783 4231862865 3545088730 844379888
This file can give you an idea of how your network receive (
NET_RX
) processing is currently distributed across your CPUs. If it is distributed unevenly, you will see a larger count value for some CPUs than others. This is one indicator that you might be able to benefit from Receive Packet Steering / Receive Flow Steering described below. Be careful using just this file when monitoring your performance: during periods of high network activity you would expect to see the rate NET_RX
increments increase, but this isn’t necessarily the case. It turns out that this is a bit nuanced, because there are additional tuning knobs in the network stack that can affect the rate at which NET_RX
softirqs will fire, which we’ll see soon.这个文件可以让你了解网络接收(
NET_RX
)处理当前在 CPU 之间的分布情况。如果分布不均匀,你会看到某些 CPU 的计数值比其他 CPU 大。这是一个指标,表明你可能会从下面描述的接收数据包导向 / 接收流导向中受益。在监控性能时,仅使用这个文件要小心:在网络活动高峰期,你可能期望看到NET_RX
的增量速率增加,但实际情况并非一定如此。事实证明,这有点微妙,因为网络栈中有其他调整旋钮会影响NET_RX
软中断的触发速率,我们很快就会看到。You should be aware of this, however, so that if you adjust the other tuning knobs you will know to examine
/proc/softirqs
and expect to see a change.不过,你应该意识到这一点,这样在调整其他调整旋钮时,你就会知道检查
/proc/softirqs
,并期望看到变化。Now, let’s move on to the networking stack and trace how network data is received from top to bottom.
现在,让我们进入网络栈,跟踪网络数据从顶层到底层的接收过程。
Linux network device subsystem(Linux 网络设备子系统)
Now that we’ve taken a look in to how network drivers and softirqs work, let’s see how the Linux network device subsystem is initialized. Then, we can follow the path of a packet starting with its arrival.
现在我们已经了解了网络驱动程序和软中断的工作原理,让我们看看 Linux 网络设备子系统是如何初始化的。然后,我们可以跟踪数据包从到达开始的路径。
Initialization of network device subsystem(网络设备子系统的初始化)
The network device (netdev) subsystem is initialized in the function
net_dev_init
. Lots of interesting things happen in this initialization function.网络设备(netdev)子系统在
net_dev_init
函数中初始化。在这个初始化函数中发生了很多有趣的事情。Initialization of struct softnet_data
structures(struct softnet_data 结构的初始化)
net_dev_init
creates a set of struct softnet_data
structures for each CPU on the system. These structures will hold pointers to several important things for processing network data:net_dev_init
为系统中的每个 CPU 创建一组struct softnet_data
结构。这些结构将保存处理网络数据所需的几个重要指针:- List for NAPI structures to be registered to this CPU.
- A backlog for data processing.
- The processing
weight
.
- The receive offload structure list.
- Receive packet steering settings.
- And more.
- 要注册到这个 CPU 的 NAPI 结构列表。
- 数据处理的积压队列。
- 处理
weight
。
- 接收卸载结构列表。
- 接收数据包导向设置。
- 还有更多。
Each of these will be examined in greater detail later as we progress up the stack.
随着我们在栈中向上推进,后面将更详细地研究这些内容。
Initialization of softirq handlers(软中断处理程序的初始化)
net_dev_init
registers a transmit and receive softirq handler which will be used to process incoming or outgoing network data. The code for this is pretty straight forward:net_dev_init
注册一个传输和接收软中断处理程序,用于处理传入或传出的网络数据。相关代码非常直接:static int __init net_dev_init(void) { /* ... */ open_softirq(NET_TX_SOFTIRQ, net_tx_action); open_softirq(NET_RX_SOFTIRQ, net_rx_action); /* ... */ }
We’ll see soon how the driver’s interrupt handler will “raise” (or trigger) the
net_rx_action
function registered to the NET_RX_SOFTIRQ
softirq.我们很快就会看到驱动程序的中断处理程序如何 “触发”(或调用)注册到
NET_RX_SOFTIRQ
软中断的net_rx_action
函数。Data arrives(数据到达)
At long last; network data arrives!
终于,网络数据到达了!
Assuming that the RX queue has enough available descriptors, the packet is written to RAM via DMA. The device then raises the interrupt that is assigned to it (or in the case of MSI-X, the interrupt tied to the rx queue the packet arrived on).
假设 RX 队列有足够的可用描述符,数据包将通过 DMA 写入内存。然后设备会发出分配给它的中断(在 MSI-X 的情况下,是与数据包到达的 rx 队列相关联的中断)。
Interrupt handler(中断处理程序)
In general, the interrupt handler which runs when an interrupt is raised should try to defer as much processing as possible to happen outside the interrupt context. This is crucial because while an interrupt is being processed, other interrupts may be blocked.
一般来说,当一个中断被触发时运行的中断处理程序应该尽量将尽可能多的处理工作推迟到中断上下文之外进行。这一点至关重要,因为在处理一个中断时,其他中断可能会被阻塞。
Let’s take a look at the source for the MSI-X interrupt handler; it will really help illustrate the idea that the interrupt handler does as little work as possible.
让我们看一下 MSI-X 中断处理程序的源代码,这将真正有助于说明中断处理程序尽量少做工作的理念。在
drivers/net/ethernet/intel/igb/igb_main.c
中:static irqreturn_t igb_msix_ring(int irq, void *data) { struct igb_q_vector *q_vector = data; /* Write the ITR value calculated from the previous interrupt. */ igb_write_itr(q_vector); napi_schedule(&q_vector->napi); return IRQ_HANDLED; }
This interrupt handler is very short and performs 2 very quick operations before returning.
这个中断处理程序非常简短,在返回之前执行了两个非常快速的操作。
First, this function calls
igb_write_itr
which simply updates a hardware specific register. In this case, the register that is updated is one which is used to track the rate hardware interrupts are arriving.首先,这个函数调用
igb_write_itr
,它只是更新一个特定于硬件的寄存器。在这种情况下,更新的寄存器用于跟踪硬件中断的到达速率。This register is used in conjunction with a hardware feature called “Interrupt Throttling” (also called “Interrupt Coalescing”) which can be used to to pace the delivery of interrupts to the CPU. We’ll see soon how
ethtool
provides a mechanism for adjusting the rate at which IRQs fire.这个寄存器与一种称为 “中断节流”(也称为 “中断合并”)的硬件功能结合使用,可用于控制中断向 CPU 的传递速率。我们很快就会看到
ethtool
如何提供一种调整 IRQ 触发速率的机制。Secondly,
napi_schedule
is called which wakes up the NAPI processing loop if it was not already active. Note that the NAPI processing loop executes in a softirq; the NAPI processing loop does not execute from the interrupt handler. The interrupt handler simply causes it to start executing if it was not already.其次,调用
napi_schedule
,如果 NAPI 处理循环尚未激活,它将唤醒该循环。请注意,NAPI 处理循环在软中断中执行,而不是在中断处理程序中执行。中断处理程序只是在 NAPI 处理循环未运行时使其开始执行。The actual code showing exactly how this works is important; it will guide our understanding of how network data is processed on multi-CPU systems.
实际展示这一过程的代码很重要,它将指导我们理解在多 CPU 系统中网络数据是如何处理的。
NAPI and napi_schedule
Let’s figure out how the
napi_schedule
call from the hardware interrupt handler works.让我们弄清楚硬件中断处理程序中的
napi_schedule
调用是如何工作的。Remember, NAPI exists specifically to harvest network data without needing interrupts from the NIC to signal that data is ready for processing. As mentioned earlier, the NAPI
poll
loop is bootstrapped by receiving a hardware interrupt. In other words: NAPI is enabled, but off, until the first packet arrives at which point the NIC raises an IRQ and NAPI is started. There are a few other cases, as we’ll see soon, where NAPI can be disabled and will need a hardware interrupt to be raised before it will be started again.请记住,NAPI 的存在是为了在不需要 NIC 发出中断信号来表明数据已准备好处理的情况下收集网络数据。如前所述,NAPI 的
poll
循环是由接收硬件中断启动的。换句话说,NAPI 已启用但处于关闭状态,直到第一个数据包到达,此时 NIC 发出 IRQ,NAPI 才会启动。还有其他一些情况,我们很快就会看到,NAPI 可能会被禁用,并且需要硬件中断才能再次启动。The NAPI poll loop is started when the interrupt handler in the driver calls
napi_schedule
. napi_schedule
is actually just a wrapper function defined in a header file which calls down to __napi_schedule
.当驱动程序中的中断处理程序调用
napi_schedule
时,NAPI 轮询循环就会启动。napi_schedule
实际上只是一个在头文件中定义的包装函数,它会调用__napi_schedule
。在net/core/dev.c
中:From net/core/dev.c:
/** * __napi_schedule - schedule for receive * @n: entry to schedule * * The entry's receive function will be scheduled to run */ void __napi_schedule(struct napi_struct *n) { unsigned long flags; local_irq_save(flags); ____napi_schedule(&__get_cpu_var(softnet_data), n); local_irq_restore(flags); } EXPORT_SYMBOL(__napi_schedule);
This code is using
__get_cpu_var
to get the softnet_data
structure that is registered to the current CPU. This softnet_data
structure and the struct napi_struct
structure handed up from the driver are passed into ____napi_schedule
. Wow, that’s a lot of underscores ;)这段代码使用
__get_cpu_var
获取注册到当前 CPU 的softnet_data
结构。这个softnet_data
结构和从驱动程序传递上来的struct napi_struct
结构被传递给____napi_schedule
。哇,好多下划线呢!Let’s take a look at
____napi_schedule
, from net/core/dev.c:让我们看一下
____napi_schedule
,在net/core/dev.c
中:/* Called with irq disabled */ static inline void ____napi_schedule(struct softnet_data *sd, struct napi_struct *napi) { list_add_tail(&napi->poll_list, &sd->poll_list); __raise_softirq_irqoff(NET_RX_SOFTIRQ); }
This code does two important things:
这段代码做了两件重要的事情:
- The
struct napi_struct
handed up from the device driver’s interrupt handler code is added to thepoll_list
attached to thesoftnet_data
structure associated with the current CPU.
- 从设备驱动程序的中断处理程序代码传递上来的
struct napi_struct
被添加到与当前 CPU 相关联的softnet_data
结构的poll_list
中。
__raise_softirq_irqoff
is used to “raise” (or trigger) a NET_RX_SOFTIRQ softirq. This will cause thenet_rx_action
registered during the network device subsystem initialization to be executed, if it’s not currently being executed.
- 使用
__raise_softirq_irqoff
“触发”(或调用)一个NET_RX_SOFTIRQ
软中断。这将导致在网络设备子系统初始化期间注册的net_rx_action
被执行(前提是它当前没有正在执行)。
As we’ll see shortly, the softirq handler function
net_rx_action
will call the NAPI poll
function to harvest packets.正如我们稍后将看到的,软中断处理函数
net_rx_action
将调用 NAPI 的poll
函数来收集数据包。A note about CPU and network data processing(关于 CPU 和网络数据处理的说明)
Note that all the code we’ve seen so far to defer work from a hardware interrupt handler to a softirq has been using structures associated with the current CPU.
请注意,到目前为止我们看到的将工作从硬件中断处理程序推迟到软中断的所有代码,都使用了与当前 CPU 相关联的结构。
While the driver’s IRQ handler itself does very little work itself, the softirq handler will execute on the same CPU as the driver’s IRQ handler.
虽然驱动程序的 IRQ 处理程序本身做的工作很少,但软中断处理程序将在与驱动程序的 IRQ 处理程序相同的 CPU 上执行。
This why setting the CPU a particular IRQ will be handled by is important: that CPU will be used not only to execute the interrupt handler in the driver, but the same CPU will also be used when harvesting packets in a softirq via NAPI.
这就是为什么设置特定 IRQ 将由哪个 CPU 处理很重要的原因:这个 CPU 不仅将用于执行驱动程序中的中断处理程序,还将用于通过 NAPI 在软中断中收集数据包。
As we’ll see later, things like Receive Packet Steering can distribute some of this work to other CPUs further up the network stack.
正如我们稍后将看到的,诸如接收数据包导向(Receive Packet Steering)之类的技术可以将部分工作分配到网络栈中更上层的其他 CPU 上。
Monitoring network data arrival(监控网络数据到达)
Hardware interrupt requests
硬件中断请求
Note: monitoring hardware IRQs does not give a complete picture of packet processing health. Many drivers turn off hardware IRQs while NAPI is running, as we'll see later. It is one important part of your whole monitoring solution.
注意:监控硬件 IRQ 并不能完全反映数据包处理的健康状况。正如我们稍后将看到的,许多驱动程序在 NAPI 运行时会关闭硬件 IRQ。它只是整个监控解决方案的一个重要部分。
Check hardware interrupt stats by reading
/proc/interrupts
.通过读取
/proc/interrupts
检查硬件中断统计信息:$ cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 46 0 0 0 IR-IO-APIC-edge timer 1: 3 0 0 0 IR-IO-APIC-edge i8042 30: 3361234770 0 0 0 IR-IO-APIC-fasteoi aacraid 64: 0 0 0 0 DMAR_MSI-edge dmar0 65: 1 0 0 0 IR-PCI-MSI-edge eth0 66: 863649703 0 0 0 IR-PCI-MSI-edge eth0-TxRx-0 67: 986285573 0 0 0 IR-PCI-MSI-edge eth0-TxRx-1 68: 45 0 0 0 IR-PCI-MSI-edge eth0-TxRx-2 69: 394 0 0 0 IR-PCI-MSI-edge eth0-TxRx-3 NMI: 9729927 4008190 3068645 3375402 Non-maskable interrupts LOC: 2913290785 1585321306 1495872829 1803524526 Local timer interrupts
You can monitor the statistics in
/proc/interrupts
to see how the number and rate of hardware interrupts change as packets arrive and to ensure that each RX queue for your NIC is being handled by an appropriate CPU. As we’ll see shortly, this number only tells us how many hardware interrupts have happened, but it is not necessarily a good metric for understanding how much data has been received or processed as many drivers will disable NIC IRQs as part of their contract with the NAPI subsystem. Further, using interrupt coalescing will also affect the statistics gathered from this file. Monitoring this file can help you determine if the interrupt coalescing settings you select are actually working.你可以监控
/proc/interrupts
中的统计信息,以查看随着数据包的到达,硬件中断的数量和速率如何变化,并确保 NIC 的每个 RX 队列都由适当的 CPU 处理。正如我们稍后将看到的,这个数字仅告诉我们发生了多少硬件中断,但它不一定是了解接收或处理了多少数据的好指标,因为许多驱动程序会作为与 NAPI 子系统的约定的一部分禁用 NIC IRQ。此外,使用中断合并也会影响从这个文件中收集的统计信息。监控这个文件可以帮助你确定选择的中断合并设置是否真正有效。To get a more complete picture of your network processing health, you’ll need to monitor
/proc/softirqs
(as mentioned above) and additional files in /proc
that we’ll cover below.为了更全面地了解网络处理的健康状况,你需要监控
/proc/softirqs
(如上文所述)以及下面我们将介绍的/proc
中的其他文件。Tuning network data arrival(调整网络数据到达)
Interrupt coalescing(中断合并)
Interrupt coalescing is a method of preventing interrupts from being raised by a device to a CPU until a specific amount of work or number of events are pending.
中断合并是一种防止设备向 CPU 发出中断的方法,直到有特定数量的工作或事件等待处理。
This can help prevent interrupt storms and can help increase throughput or latency, depending on the settings used. Fewer interrupts generated result in higher throughput, increased latency, and lower CPU usage. More interrupts generated result in the opposite: lower latency, lower throughput, but also increased CPU usage.
这有助于防止中断风暴,并且根据所使用的设置,还可以帮助提高吞吐量或降低延迟。生成的中断越少,吞吐量越高,延迟增加,CPU 使用率越低;生成的中断越多,情况则相反:延迟降低,吞吐量降低,但 CPU 使用率也会增加。
Historically, earlier versions of the
igb
, e1000
, and other drivers included support for a parameter called InterruptThrottleRate
. This parameter has been replaced in more recent drivers with a generic ethtool
function.从历史上看,早期版本的
igb
、e1000
和其他驱动程序包含一个名为InterruptThrottleRate
的参数支持。在更新的驱动程序中,这个参数已被一个通用的ethtool
函数取代。Get the current IRQ coalescing settings with
ethtool -c
.使用
ethtool -c
获取当前的 IRQ 合并设置:$ sudo ethtool -c eth0 Coalesce parameters for eth0: Adaptive RX: off TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 ...
ethtool
provides a generic interface for setting various coalescing settings. Keep in mind, however, that not every device or driver will support every setting. You should check your driver documentation or driver source code to determine what is, or is not, supported. As per the ethtool documentation: “Anything not implemented by the driver causes these values to be silently ignored.”ethtool
提供了一个通用接口来设置各种合并设置。然而,请记住,并非每个设备或驱动程序都支持所有设置。你应该查看驱动程序文档或驱动程序源代码,以确定哪些设置受支持,哪些不受支持。根据ethtool
文档:“任何未被驱动程序实现的设置都会被静默忽略。”One interesting option that some drivers support is “adaptive RX/TX IRQ coalescing.” This option is typically implemented in hardware. The driver usually needs to do some work to inform the NIC that this feature is enabled and some bookkeeping as well (as seen in the
igb
driver code above).一些驱动程序支持的一个有趣选项是 “自适应 RX/TX IRQ 合并”。这个选项通常在硬件中实现。驱动程序通常需要做一些工作来通知 NIC 启用此功能,并进行一些簿记工作(如上面
igb
驱动程序代码中所示)。The result of enabling adaptive RX/TX IRQ coalescing is that interrupt delivery will be adjusted to improve latency when packet rate is low and also improve throughput when packet rate is high.
启用自适应 RX/TX IRQ 合并的结果是,在数据包速率较低时,中断传递将被调整以改善延迟;在数据包速率较高时,则会提高吞吐量。
Enable adaptive RX IRQ coalescing with
ethtool -C
使用
ethtool -C
启用自适应 RX IRQ 合并:$ sudo ethtool -C eth0 adaptive-rx on
You can also use
ethtool -C
to set several options. Some of the more common options to set are:你还可以使用
ethtool -C
设置多个选项。一些比较常见的可设置选项包括:rx-usecs
: How many usecs to delay an RX interrupt after a packet arrives.
rx-frames
: Maximum number of data frames to receive before an RX interrupt.
rx-usecs-irq
: How many usecs to delay an RX interrupt while an interrupt is being serviced by the host.
rx-frames-irq
: Maximum number of data frames to receive before an RX interrupt is generated while the system is servicing an interrupt.
And many, many more.
rx-usecs
:数据包到达后,延迟 RX 中断的微秒数。
rx-frames
:在产生 RX 中断之前,接收的数据帧的最大数量。
rx-usecs-irq
:在主机处理中断时,延迟 RX 中断的微秒数。
rx-frames-irq
:在系统处理中断时,产生 RX 中断之前,接收的数据帧的最大数量。
还有很多其他选项。
Reminder that your hardware and driver may only support a subset of the options listed above. You should consult your driver source code and your hardware data sheet for more information on supported coalescing options.
请记住,你的硬件和驱动程序可能仅支持上述选项的一个子集。你应该查阅驱动程序源代码和硬件数据表,以获取有关支持的合并选项的更多信息。
Unfortunately, the options you can set aren’t well documented anywhere except in a header file. Check the source of include/uapi/linux/ethtool.h to find an explanation of each option supported by
ethtool
(but not necessarily your driver and NIC).不幸的是,除了在一个头文件中,你可以设置的选项并没有很好的文档说明。查看
include/uapi/linux/ethtool.h
的源代码,以找到ethtool
支持的每个选项的解释(但不一定适用于你的驱动程序和 NIC)。Note: while interrupt coalescing seems to be a very useful optimization at first glance, the rest of the networking stack internals also come into the fold when attempting to optimize. Interrupt coalescing can be useful in some cases, but you should ensure that the rest of your networking stack is also tuned properly. Simply modifying your coalescing settings alone will likely provide minimal benefit in and of itself.
注意:虽然乍一看中断合并似乎是一种非常有用的优化,但在尝试优化时,网络栈的其他内部机制也会起作用。中断合并在某些情况下可能有用,但你应该确保网络栈的其他部分也进行了适当的调整。仅仅修改合并设置本身可能只会带来最小的好处。
Adjusting IRQ affinities(调整 IRQ 亲和力)
If your NIC supports RSS / multiqueue or if you are attempting to optimize for data locality, you may wish to use a specific set of CPUs for handling interrupts generated by your NIC.
如果你的 NIC 支持 RSS / 多队列功能,或者你试图优化数据局部性,你可能希望使用一组特定的 CPU 来处理 NIC 生成的中断。
Setting specific CPUs allows you to segment which CPUs will be used for processing which IRQs. These changes may affect how upper layers operate, as we’ve seen for the networking stack.
设置特定的 CPU 可以让你划分哪些 CPU 将用于处理哪些 IRQ。这些更改可能会影响上层的操作,就像我们在网络栈中看到的那样。
If you do decide to adjust your IRQ affinities, you should first check if you running the
irqbalance
daemon. This daemon tries to automatically balance IRQs to CPUs and it may overwrite your settings. If you are running irqbalance
, you should either disable irqbalance
or use the --banirq
in conjunction with IRQBALANCE_BANNED_CPUS
to let irqbalance
know that it shouldn’t touch a set of IRQs and CPUs that you want to assign yourself.如果你决定调整 IRQ 亲和力,首先应该检查是否正在运行
irqbalance
守护进程。这个守护进程试图自动将 IRQ 平衡到各个 CPU 上,它可能会覆盖你的设置。如果你正在运行irqbalance
,你应该要么禁用它,要么使用--banirq
结合IRQBALANCE_BANNED_CPUS
,让irqbalance
知道不要触碰你想要自己分配的一组 IRQ 和 CPU。Next, you should check the file
/proc/interrupts
for a list of the IRQ numbers for each network RX queue for your NIC.接下来,你应该查看
/proc/interrupts
文件,获取 NIC 每个网络 RX 队列的 IRQ 编号列表。Finally, you can adjust the which CPUs each of those IRQs will be handled by modifying
/proc/irq/IRQ_NUMBER/smp_affinity
for each IRQ number.最后,你可以通过修改每个 IRQ 编号对应的
/proc/irq/IRQ_NUMBER/smp_affinity
文件,来调整每个 IRQ 将由哪些 CPU 处理。You simply write a hexadecimal bitmask to this file to instruct the kernel which CPUs it should use for handling the IRQ.
你只需向这个文件写入一个十六进制位掩码,就可以指示内核应该使用哪些 CPU 来处理该 IRQ。
Example: Set the IRQ affinity for IRQ 8 to CPU 0
示例:将 IRQ 8 的 IRQ 亲和力设置为 CPU 0
$ sudo bash -c 'echo 1 > /proc/irq/8/smp_affinity'
Network data processing begins(网络数据处理开始)
Once the softirq code determines that a softirq is pending, begins processing, and executes
net_rx_action
, network data processing begins.一旦软中断代码确定有一个软中断处于挂起状态,开始处理并执行
net_rx_action
,网络数据处理就开始了。Let’s take a look at portions of the
net_rx_action
processing loop to understand how it works, which pieces are tunable, and what can be monitored.让我们看一下
net_rx_action
处理循环的部分内容,以了解它是如何工作的、哪些部分是可调整的,以及可以监控哪些内容。net_rx_action
processing loop(net_rx_action 处理循环)
net_rx_action
begins the processing of packets from the memory the packets were DMA’d into by the device.net_rx_action
开始处理设备通过 DMA 传输到内存中的数据包。The function iterates through the list of NAPI structures that are queued for the current CPU, dequeuing each structure, and operating on it.
该函数遍历为当前 CPU 排队的 NAPI 结构列表,将每个结构出队并进行处理。
The processing loop bounds the amount of work and execution time that can be consumed by the registered NAPI
poll
functions. It does this in two ways:处理循环限制了注册的 NAPI
poll
函数可以消耗的工作量和执行时间,它通过两种方式实现这一点:- By keeping track of a work
budget
(which can be adjusted), and
- Checking the elapsed time
- 跟踪一个工作
budget
(可以调整)。
- 检查经过的时间。在
net/core/dev.c
中:
From net/core/dev.c:
while (!list_empty(&sd->poll_list)) { struct napi_struct *n; int work, weight; /* If softirq window is exhausted then punt. * Allow this to run for 2 jiffies since which will allow * an average latency of 1.5/HZ. */ if (unlikely(budget <= 0 || time_after_eq(jiffies, time_limit))) goto softnet_break;
This is how the kernel prevents packet processing from consuming the entire CPU. The
budget
above is the total available budget that will be spent among each of the available NAPI structures registered to this CPU.这就是内核防止数据包处理占用整个 CPU 的方式。上面的
budget
是分配给当前 CPU 上每个可用 NAPI 结构的总可用预算。This is another reason why multiqueue NICs should have the IRQ affinity carefully tuned. Recall that the CPU which handles the IRQ from the device will be the CPU where the softirq handler will execute and, as a result, will also be the CPU where the above loop and budget computation runs.
这也是为什么具有多队列的 NIC 应该仔细调整 IRQ 亲和力的另一个原因。回想一下,处理设备 IRQ 的 CPU 将是软中断处理程序执行的 CPU,因此,也是上述循环和预算计算运行的 CPU。
Systems with multiple NICs each with multiple queues can end up in a situation where multiple NAPI structs are registered to the same CPU. Data processing for all NAPI structs on the same CPU spend from the same
budget
.具有多个 NIC 且每个 NIC 都有多个队列的系统可能会出现多个 NAPI 结构注册到同一个 CPU 的情况。同一 CPU 上所有 NAPI 结构的数据处理都从相同的
budget
中消耗资源。If you don’t have enough CPUs to distribute your NIC’s IRQs, you can consider increasing the
net_rx_action
budget
to allow for more packet processing for each CPU. Increasing the budget will increase CPU usage (specifically sitime
or si
in top
or other programs), but should reduce latency as data will be processed more promptly.如果你没有足够的 CPU 来分配 NIC 的 IRQ,你可以考虑增加
net_rx_action
的budget
,以便每个 CPU 可以处理更多的数据包。增加预算会增加 CPU 使用率(特别是在top
或其他程序中的si
时间或si
字段),但应该会减少延迟,因为数据将被更及时地处理。Note: the CPU will still be bounded by a time limit of 2 jiffies, regardless of the assigned budget.
注意:无论分配的预算是多少,CPU 仍然会受到 2 个 jiffies 的时间限制。
NAPI poll
function and weight
(NAPI poll 函数和 weight)
Recall that network device drivers use
netif_napi_add
for registering poll
function. As we saw earlier in this post, the igb
driver has a piece of code like this:回想一下,网络设备驱动程序使用
netif_napi_add
注册poll
函数。正如我们在本文前面看到的,igb
驱动程序有类似这样的代码:/* initialize NAPI */ netif_napi_add(adapter->netdev, &q_vector->napi, igb_poll, 64);
This registers a NAPI structure with a hardcoded weight of 64. We’ll see now how this is used in the
net_rx_action
processing loop.这会注册一个 NAPI 结构,其权重被硬编码为 64。我们现在将看看这个权重在
net_rx_action
处理循环中是如何使用的。在net/core/dev.c
中:From net/core/dev.c:
weight = n->weight; work = 0; if (test_bit(NAPI_STATE_SCHED, &n->state)) { work = n->poll(n, weight); trace_napi_poll(n); } WARN_ON_ONCE(work > weight); budget -= work;
This code obtains the weight which was registered to the NAPI struct (
64
in the above driver code) and passes it into the poll
function which was also registered to the NAPI struct (igb_poll
in the above code).这段代码获取注册到 NAPI 结构的权重(在上面的驱动程序代码中为 64),并将其传递给也注册到该 NAPI 结构的
poll
函数(在上面的代码中为igb_poll
)。The
poll
function returns the number of data frames that were processed. This amount is saved above as work
, which is then subtracted from the overall budget
.poll
函数返回处理的数据帧数量,这个数量在上面保存为work
,然后从总budget
中减去。So, assuming:
- You are using a weight of
64
from your driver (all drivers were hardcoded with this value in Linux 3.13.0), and
- You have your
budget
set to the default of300
Your system would stop processing data when either:
- The
igb_poll
function was called at most 5 times (less if no data to process as we’ll see next), OR
- At least 2 jiffies of time have elapsed.
所以,假设:
- 你使用的驱动程序中的权重为 64(在 Linux 3.13.0 中,所有驱动程序都将此值硬编码为此)。
- 你的
budget
设置为默认的 300。
你的系统将在以下两种情况之一停止处理数据:
igb_poll
函数最多被调用 5 次(如果没有数据要处理,调用次数会更少,我们接下来会看到)。
- 至少经过了 2 个 jiffies 的时间。
The NAPI / network device driver contract(NAPI / 网络设备驱动程序约定)
One important piece of information about the contract between the NAPI subsystem and device drivers which has not been mentioned yet are the requirements around shutting down NAPI.
关于 NAPI 子系统和设备驱动程序之间的约定,有一个重要信息之前尚未提及,那就是关于关闭 NAPI 的要求。
This part of the contract is as follows:
这个约定的这部分内容如下:
- If a driver’s
poll
function consumes its entire weight (which is hardcoded to64
) it must NOT modify NAPI state. Thenet_rx_action
loop will take over.
- If a driver’s
poll
function does NOT consume its entire weight, it must disable NAPI. NAPI will be re-enabled next time an IRQ is received and the driver’s IRQ handler callsnapi_schedule
.
- 如果驱动程序的
poll
函数消耗了其全部权重(硬编码为 64),它不能修改 NAPI 状态,net_rx_action
循环将接管。
- 如果驱动程序的
poll
函数没有消耗其全部权重,它必须禁用 NAPI。下次接收到 IRQ 并且驱动程序的 IRQ 处理程序调用napi_schedule
时,NAPI 将重新启用。
We’ll see how
net_rx_action
deals with the first part of that contract now. Next, the poll
function is examined, we’ll see how the second part of that contract is handled.我们现在将看看
net_rx_action
如何处理该约定的第一部分。接下来,在检查poll
函数时,我们将看到该约定的第二部分是如何处理的。Finishing the net_rx_action
loop(完成 net_rx_action 循环)
The
net_rx_action
processing loop finishes up with one last section of code that deals with the first part of the NAPI contract explained in the previous section. From net/core/dev.c:net_rx_action
处理循环以最后一段代码结束,这段代码处理上一节中解释的 NAPI 约定的第一部分。在net/core/dev.c
中:/* Drivers must not modify the NAPI state if they * consume the entire weight. In such cases this code * still "owns" the NAPI instance and therefore can * move the instance around on the list at-will. */ if (unlikely(work == weight)) { if (unlikely(napi_disable_pending(n))) { local_irq_enable(); napi_complete(n); local_irq_disable(); } else { if (n->gro_list) { /* flush too old packets * If HZ < 1000, flush all packets. */ local_irq_enable(); napi_gro_flush(n, HZ >= 1000); local_irq_disable(); } list_move_tail(&n->poll_list, &sd->poll_list); } }
If the entire work is consumed, there are two cases that
net_rx_action
handles:如果工作全部完成,
net_rx_action
会处理两种情况:- The network device should be shutdown (e.g. because the user ran
ifconfig eth0 down
), - 网络设备应该关闭(例如,因为用户运行了
ifconfig eth0 down
命令)。
- If the device is not being shutdown, check if there’s a generic receive offload (GRO) list. If the timer tick rate is >= 1000, all GRO’d network flows that were recently updated will be flushed. We’ll dig into GRO in detail later. Move the NAPI structure to the end of the list for this CPU so the next iteration of the loop will get the next NAPI structure registered.
- 如果设备没有关闭,检查是否有通用接收卸载(GRO)列表。如果定时器滴答率
<1000
,所有最近更新的GRO网络流将被刷新。我们稍后会深入研究GRO。将NAPI结构移动到该CPU列表的末尾,以便循环的下一次迭代可以获取注册的下一个NAPI结构。
And that is how the packet processing loop invokes the driver’s registered
poll
function to process packets. As we’ll see shortly, the poll
function will harvest network data and send it up the stack to be processed.这就是数据包处理循环调用驱动程序注册的
poll
函数来处理数据包的方式。正如我们稍后将看到的,poll
函数将收集网络数据并将其发送到栈中进行进一步处理。Exiting the loop when limits are reached(达到限制时退出循环)
The
net_rx_action
loop will exit when either:net_rx_action
循环将在以下情况之一退出:- The poll list registered for this CPU has no more NAPI structures (
!list_empty(&sd->poll_list)
), or - 为该CPU注册的轮询列表中没有更多NAPI结构(
!list_empty(&sd->poll_list)
)。
- The remaining budget is <= 0, or
- 剩余预算
<= 0
。
- The time limit of 2 jiffies has been reached
- 达到 2 个 jiffies 的时间限制。
Here’s this code we saw earlier again:
这是我们之前看到的代码:
/* If softirq window is exhausted then punt. * Allow this to run for 2 jiffies since which will allow * an average latency of 1.5/HZ. */ if (unlikely(budget <= 0 || time_after_eq(jiffies, time_limit))) goto softnet_break;
If you follow the
softnet_break
label you stumble upon something interesting. From net/core/dev.c:如果跟随
softnet_break
标签,你会发现一些有趣的事情。在net/core/dev.c
中:softnet_break: sd->time_squeeze++; __raise_softirq_irqoff(NET_RX_SOFTIRQ); goto out;
The
struct softnet_data
structure has some statistics incremented and the softirq NET_RX_SOFTIRQ
is shut down. The time_squeeze
field is a measure of the number of times net_rx_action
had more work to do but either the budget was exhausted or the time limit was reached before it could be completed. This is a tremendously useful counter for understanding bottlenecks in network processing. We’ll see shortly how to monitor this value. The NET_RX_SOFTIRQ
is disabled to free up processing time for other tasks. This makes sense as this small stub of code is only executed when more work could have been done, but we don’t want to monopolize the CPU.struct softnet_data
结构的一些统计信息会增加,并且NET_RX_SOFTIRQ
软中断会被关闭。time_squeeze
字段用于衡量net_rx_action
有更多工作要做,但在完成之前预算耗尽或达到时间限制的次数。这是一个非常有用的计数器,用于了解网络处理中的瓶颈。我们稍后将看到如何监控这个值。NET_RX_SOFTIRQ
被禁用,以便为其他任务释放处理时间。这是有意义的,因为只有在还有更多工作要做,但我们又不想独占 CPU 的情况下,才会执行这个小代码段。Execution is then transferred to the
out
label. Execution can also make it to the out
label if there were no more NAPI structures to process, in other words, there is more budget than there is network activity and all the drivers have shut NAPI off and there is nothing left for net_rx_action
to do.然后执行转移到
out
标签。如果没有更多的 NAPI 结构要处理,即预算比网络活动多,并且所有驱动程序都已关闭 NAPI,net_rx_action
没有其他事情可做,执行也会到达out
标签。The
out
section does one important thing before returning from net_rx_action
: it calls net_rps_action_and_irq_enable
. This function serves an important purpose if Receive Packet Steering is enabled; it wakes up remote CPUs to start processing network data.out
部分在从net_rx_action
返回之前做了一件重要的事情:它调用net_rps_action_and_irq_enable
。如果启用了接收数据包导向(Receive Packet Steering),这个函数将唤醒远程 CPU 以开始处理网络数据。We’ll see more about how RPS works later. For now, let’s see how to monitor the health of the
net_rx_action
processing loop and move on to the inner working of NAPI poll
functions so we can progress up the network stack.NAPI poll
Recall in previous sections that device drivers allocate a region of memory for the device to perform DMA to incoming packets. Just as it is the responsibility of the driver to allocate those regions, it is also the responsibility of the driver to unmap those regions, harvest the data, and send it up the network stack.
回想一下前面的章节,设备驱动程序为设备分配一块内存区域,以便设备对传入的数据包执行 DMA 操作。正如分配这些区域是驱动程序的责任一样,取消映射这些区域、收集数据并将其发送到网络栈也是驱动程序的责任。
Let’s take a look at how the
igb
driver does this to get an idea of how this works in practice.让我们看看
igb
驱动程序是如何做到这一点的,以便了解实际情况。igb_poll
At long last, we can finally examine our friend
igb_poll
. It turns out the code for igb_poll
is deceptively simple. Let’s take a look. From drivers/net/ethernet/intel/igb/igb_main.c:终于,我们可以研究一下
igb_poll
函数了。事实证明,igb_poll
的代码看似简单,实则不然。让我们来看看。在drivers/net/ethernet/intel/igb/igb_main.c
中:/** * igb_poll - NAPI Rx polling callback, NAPI Rx轮询回调函数 * @napi: napi polling structure, napi轮询结构 * @budget: count of how many packets we should handle, 我们应该处理的数据包数量 **/ static int igb_poll(struct napi_struct *napi, int budget) { struct igb_q_vector *q_vector = container_of(napi, struct igb_q_vector, napi); bool clean_complete = true; #ifdef CONFIG_IGB_DCA if (q_vector->adapter->flags & IGB_FLAG_DCA_ENABLED) igb_update_dca(q_vector); #endif /* ... */ if (q_vector->rx.ring) clean_complete &= igb_clean_rx_irq(q_vector, budget); /* If all work not completed, return budget and keep polling */ if (!clean_complete) return budget; /* If not enough Rx work done, exit the polling mode */ napi_complete(napi); igb_ring_irq_enable(q_vector); return 0; }
This code does a few interesting things:
这段代码做了几件有趣的事情:
- If Direct Cache Access (DCA) support is enabled in the kernel, the CPU cache is warmed so that accesses to the RX ring will hit CPU cache. You can read more about DCA in the Extras section at the end of this blog post.
- 如果内核中启用了直接缓存访问(DCA)支持,CPU 缓存将被预热,以便对 RX 环的访问能够命中 CPU 缓存。你可以在本文末尾的 “其他内容” 部分中阅读更多关于 DCA 的信息。
- Next,
igb_clean_rx_irq
is called which does the heavy lifting, as we’ll see next.
- 接下来,调用
igb_clean_rx_irq
函数,它将完成主要的工作,我们接下来会看到。
- Next,
clean_complete
is checked to determine if there was still more work that could have been done. If so, thebudget
(remember, this was hardcoded to64
) is returned. As we saw earlier,net_rx_action
will move this NAPI structure to the end of the poll list.
- 然后,检查
clean_complete
以确定是否还有更多工作可以做。如果是这样,返回budget
(请记住,在上面的代码中,它被硬编码为 64)。正如我们之前看到的,net_rx_action
将把这个 NAPI 结构移动到轮询列表的末尾。
- Otherwise, the driver turns off NAPI by calling
napi_complete
and re-enables interrupts by callingigb_ring_irq_enable
. The next interrupt that arrives will re-enable NAPI.
- 否则,驱动程序通过调用
napi_complete
关闭 NAPI,并通过调用igb_ring_irq_enable
重新启用中断。下一个到达的中断将重新启用 NAPI。
Let’s see how
igb_clean_rx_irq
sends network data up the stack.让我们看看
igb_clean_rx_irq
是如何将网络数据发送到栈中的。igb_clean_rx_irq
The
igb_clean_rx_irq
function is a loop which processes one packet at a time until the budget
is reached or no additional data is left to process.igb_clean_rx_irq
函数是一个循环,它一次处理一个数据包,直到达到budget
限制或没有更多数据可处理。The loop in this function does a few important things:
这个函数中的循环做了几件重要的事情:
- Allocates additional buffers for receiving data as used buffers are cleaned out. Additional buffers are added
IGB_RX_BUFFER_WRITE
(16) at a time. - 当清理已使用的缓冲区时,为接收数据分配额外的缓冲区。每次以
IGB_RX_BUFFER_WRITE
(16)为单位添加额外的缓冲区。
- Fetch a buffer from the RX queue and store it in an
skb
structure. - 从 RX 队列中获取一个缓冲区,并将其存储在一个
skb
结构中。
- Check if the buffer is an “End of Packet” buffer. If so, continue processing. Otherwise, continue fetching additional buffers from the RX queue, adding them to the
skb
. This is necessary if a received data frame is larger than the buffer size. - 检查该缓冲区是否为 “数据包结束” 缓冲区。如果是,则继续处理;否则,继续从 RX 队列中获取更多缓冲区,并将它们添加到
skb
中。如果接收到的数据帧大于缓冲区大小,这是必要的操作。
- Verify that the layout and headers of the data are correct.
- 验证数据的布局和头部是否正确。
- The number of bytes processed statistic counter is increased by
skb->len
. - 将处理的字节数统计计数器增加
skb->len
。
- Set the hash, checksum, timestamp, VLAN id, and protocol fields of the skb. The hash, checksum, timestamp, and VLAN id are provided by the hardware. If the hardware is signaling a checksum error, the
csum_error
statistic is incremented. If the checksum succeeded and the data is UDP or TCP data, theskb
is marked asCHECKSUM_UNNECESSARY
. If the checksum failed, the protocol stacks are left to deal with this packet. The protocol is computed with a call toeth_type_trans
and stored in theskb
struct. - 设置
skb
的哈希、校验和、时间戳、VLAN ID 和协议字段。哈希、校验和、时间戳和 VLAN ID 由硬件提供。如果硬件检测到校验和错误,csum_error
统计信息将增加。如果校验和成功,并且数据是 UDP 或 TCP 数据,则将skb
标记为CHECKSUM_UNNECESSARY
。如果校验和失败,协议栈将负责处理这个数据包。通过调用eth_type_trans
计算协议,并将其存储在skb
结构中。
- The constructed
skb
is handed up the network stack with a call tonapi_gro_receive
. - 通过调用
napi_gro_receive
将构造好的skb
传递到网络栈中。
- The number of packets processed statistics counter is incremented.
- 增加处理的数据包数量统计计数器。
- The loop continues until the number of packets processed reaches the budget.
- 循环继续,直到处理的数据包数量达到预算。
Once the loop terminates, the function assigns statistics counters for rx packets and bytes processed.
一旦循环终止,该函数会为接收的数据包和处理的字节数分配统计计数器。
Now it’s time to take two detours prior to proceeding up the network stack. First, let’s see how to monitor and tune the network subsystem’s softirqs. Next, let’s talk about Generic Receive Offloading (GRO). After that, the rest of the networking stack will make more sense as we enter
napi_gro_receive
.现在,在继续向上研究网络栈之前,我们需要进行两个小插曲。首先,让我们看看如何监控和调整网络子系统的软中断。接下来,让我们讨论一下通用接收卸载(Generic Receive Offloading,GRO)。之后,当我们进入
napi_gro_receive
时,网络栈的其余部分会更容易理解。Monitoring network data processing(监控网络数据处理)
/proc/net/softnet_stat
As seen in the previous section,
net_rx_action
increments a statistic when exiting the net_rx_action
loop and when additional work could have been done, but either the budget
or the time limit for the softirq was hit. This statistic is tracked as part of the struct softnet_data
associated with the CPU.如前一节所述,当
net_rx_action
退出循环,并且还有更多工作可以做,但budget
或软中断的时间限制已达到时,它会增加一个统计信息。这个统计信息作为与 CPU 相关联的struct softnet_data
的一部分进行跟踪。These statistics are output to a file in proc:
/proc/net/softnet_stat
for which there is, unfortunately, very little documentation. The fields in the file in proc are not labeled and could change between kernel releases.这些统计信息输出到
proc
文件系统中的一个文件:/proc/net/softnet_stat
,遗憾的是,关于这个文件的文档很少。proc
文件中的字段没有标记,并且可能在不同的内核版本之间发生变化。In Linux 3.13.0, you can find which values map to which field in
/proc/net/softnet_stat
by reading the kernel source. From net/core/net-procfs.c:在 Linux 3.13.0 中,你可以通过阅读内核源代码来确定
/proc/net/softnet_stat
中哪些值对应哪些字段。在net/core/net-procfs.c
中:seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n", sd->processed, sd->dropped, sd->time_squeeze, 0, 0, 0, 0, 0, /* was fastroute */ sd->cpu_collision, sd->received_rps, flow_limit_count);
Many of these statistics have confusing names and are incremented in places where you might not expect. An explanation of when and where each of these is incremented will be provided as the network stack is examined. Since the
squeeze_time
statistic was seen in net_rx_action
, I thought it made sense to document this file now.这些统计信息中的许多名称都容易引起混淆,并且在你可能想不到的地方增加。在研究网络栈时,将提供每个统计信息何时何地增加的解释。由于在
net_rx_action
中看到了squeeze_time
统计信息,我认为现在记录这个文件是有意义的。Monitor network data processing statistics by reading
/proc/net/softnet_stat
.通过读取
/proc/net/softnet_stat
监控网络数据处理统计信息:$ cat /proc/net/softnet_stat 6dcad223 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000 6f0e1565 00000000 00000002 00000000 00000000 00000000 00000000 00000000 00000000 00000000 660774ec 00000000 00000003 00000000 00000000 00000000 00000000 00000000 00000000 00000000 61c99331 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 6794b1b3 00000000 00000005 00000000 00000000 00000000 00000000 00000000 00000000 00000000 6488cb92 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Important details about
/proc/net/softnet_stat
:关于
/proc/net/softnet_stat
的重要细节:- Each line of
/proc/net/softnet_stat
corresponds to astruct softnet_data
structure, of which there is 1 per CPU.
/proc/net/softnet_stat
的每一行对应一个struct softnet_data
结构,每个 CPU 有一个这样的结构。
- The values are separated by a single space and are displayed in hexadecimal
- 值之间用单个空格分隔,并以十六进制显示。
- The first value,
sd->processed
, is the number of network frames processed. This can be more than the total number of network frames received if you are using ethernet bonding. There are cases where the ethernet bonding driver will trigger network data to be re-processed, which would increment thesd->processed
count more than once for the same packet.
- 第一个值
sd->processed
是处理的网络帧数。如果你使用以太网绑定,这个值可能会大于接收的网络帧总数。在某些情况下,以太网绑定驱动程序会触发网络数据重新处理,这会使sd->processed
计数对同一个数据包增加多次。
- The second value,
sd->dropped
, is the number of network frames dropped because there was no room on the processing queue. More on this later.
- 第二个值
sd->dropped
是由于处理队列中没有空间而丢弃的网络帧数。稍后会详细介绍。
- The third value,
sd->time_squeeze
, is (as we saw) the number of times thenet_rx_action
loop terminated because the budget was consumed or the time limit was reached, but more work could have been. Increasing thebudget
as explained earlier can help reduce this.
- 第三个值
sd->time_squeeze
(如我们所见)是net_rx_action
循环因预算耗尽或达到时间限制而终止,但还有更多工作可做的次数。如前所述,增加budget
可以帮助减少这个值。
- The next 5 values are always 0.
- 接下来的 5 个值始终为 0。
- The ninth value,
sd->cpu_collision
, is a count of the number of times a collision occurred when trying to obtain a device lock when transmitting packets. This article is about receive, so this statistic will not be seen below.
- 第九个值
sd->cpu_collision
是在传输数据包时尝试获取设备锁时发生冲突的次数。本文是关于接收的,所以下面不会看到这个统计信息。
- The tenth value,
sd->received_rps
, is a count of the number of times this CPU has been woken up to process packets via an Inter-processor Interrupt
- 第十个值
sd->received_rps
是这个 CPU 通过处理器间中断被唤醒以处理数据包的次数。
- The last value,
flow_limit_count
, is a count of the number of times the flow limit has been reached. Flow limiting is an optional Receive Packet Steering feature that will be examined shortly.
- 最后一个值
flow_limit_count
是达到流量限制的次数。流量限制是接收数据包导向的一个可选功能,稍后将进行研究。
If you decide to monitor this file and graph the results, you must be extremely careful that the ordering of these fields hasn’t changed and that the meaning of each field has been preserved. You will need to read the kernel source to verify this.
如果你决定监控这个文件并绘制结果图表,必须非常小心,确保这些字段的顺序没有改变,并且每个字段的含义保持不变。你需要阅读内核源代码来验证这一点。
Tuning network data processing(调整网络数据处理)
Adjusting the
net_rx_action
budget(调整 net_rx_action 预算)You can adjust the
net_rx_action
budget, which determines how much packet processing can be spent among all NAPI structures registered to a CPU by setting a sysctl value named net.core.netdev_budget
.你可以通过设置一个名为
net.core.netdev_budget
的 sysctl 值来调整net_rx_action
预算,该预算决定了在注册到一个 CPU 的所有 NAPI 结构之间可以花费多少数据包处理资源。Example: set the overall packet processing budget to 600.
示例:将整体数据包处理预算设置为 600
$ sudo sysctl -w net.core.netdev_budget=600
You may also want to write this setting to your
/etc/sysctl.conf
file so that changes persist between reboots.你可能还想将这个设置写入
/etc/sysctl.conf
文件,以便更改在重启后仍然有效。The default value on Linux 3.13.0 is 300.
在 Linux 3.13.0 中,默认值是 300。
Generic Receive Offloading (GRO)(通用接收卸载(GRO))
Generic Receive Offloading (GRO) is a software implementation of a hardware optimization that is known as Large Receive Offloading (LRO).
通用接收卸载(Generic Receive Offloading,GRO)是一种硬件优化(称为大接收卸载,Large Receive Offloading,LRO)的软件实现。
The main idea behind both methods is that reducing the number of packets passed up the network stack by combining “similar enough” packets together can reduce CPU usage. For example, imagine a case where a large file transfer is occurring and most of the packets contain chunks of data in the file. Instead of sending small packets up the stack one at a time, the incoming packets can be combined into one packet with a huge payload. That packet can then be passed up the stack. This allows the protocol layers to process a single packet’s headers while delivering bigger chunks of data to the user program.
这两种方法的主要思想是,通过将 “足够相似” 的数据包合并在一起,减少传递到网络栈的数据包数量,从而降低 CPU 使用率。例如,想象一个大文件传输的场景,大多数数据包包含文件中的数据块。与其一次将小数据包逐个发送到栈中,不如将传入的数据包合并成一个带有巨大有效负载的数据包,然后将这个数据包传递到栈中。这使得协议层可以处理单个数据包的头部,同时将更大的数据块传递给用户程序。
The problem with this sort of optimization is, of course, information loss. If a packet had some important option or flag set, that option or flag could be lost if the packet is coalesced into another. And this is exactly why most people don’t use or encourage the use of LRO. LRO implementations, generally speaking, had very lax rules for coalescing packets.
当然,这种优化的问题在于信息丢失。如果一个数据包设置了一些重要的选项或标志,当它被合并到另一个数据包中时,这些选项或标志可能会丢失。这正是为什么大多数人不使用或不鼓励使用 LRO 的原因。一般来说,LRO 实现对于合并数据包的规则非常宽松。
GRO was introduced as an implementation of LRO in software, but with more strict rules around which packets can be coalesced.
GRO 作为 LRO 的软件实现被引入,但对于哪些数据包可以合并有更严格的规则。
By the way: if you have ever used
tcpdump
and seen unrealistically large incoming packet sizes, it is most likely because your system has GRO enabled. As you’ll see soon, packet capture taps are inserted further up the stack, after GRO has already happened.顺便说一下:如果你曾经使用过
tcpdump
,并且看到不切实际的大传入数据包大小,很可能是因为你的系统启用了 GRO。正如你很快就会看到的,数据包捕获点是在网络栈中更靠上的位置插入的,在 GRO 已经发生之后。Tuning: Adjusting GRO settings with ethtool
(调整:使用 ethtool 调整 GRO 设置)
You can use
ethtool
to check if GRO is enabled and also to adjust the setting.你可以使用
ethtool
检查 GRO 是否启用,也可以调整这个设置。Use
ethtool -k
to check your GRO settings.使用
ethtool -k
检查你的 GRO 设置:$ ethtool -k eth0 | grep generic-receive-offload generic-receive-offload: on
As you can see, on this system I have
generic-receive-offload
set to on.如你所见,在这个系统上,我将
generic-receive-offload
设置为开启。Use
ethtool -K
to enable (or disable) GRO.使用
ethtool -K
启用(或禁用)GRO:$ sudo ethtool -K eth0 gro on
Note: making these changes will, for most drivers, take the interface down and then bring it back up; connections to this interface will be interrupted. This may not matter much for a one-time change, though.
注意:对于大多数驱动程序,进行这些更改会使网络接口先关闭再重新启动,与该接口的连接将被中断。不过,对于一次性更改而言,这可能影响不大。
napi_gro_receive
The function
napi_gro_receive
deals processing network data for GRO (if GRO is enabled for the system) and sending the data up the stack toward the protocol layers. Much of this logic is handled in a function called dev_gro_receive
.napi_gro_receive
函数负责处理网络数据的 GRO(如果系统启用了 GRO),并将数据发送到栈中,朝着协议层传递。大部分逻辑在一个名为dev_gro_receive
的函数中处理。dev_gro_receive
This function begins by checking if GRO is enabled and, if so, preparing to do GRO. In the case where GRO is enabled, a list of GRO offload filters is traversed to allow the higher level protocol stacks to act on a piece of data which is being considered for GRO. This is done so that the protocol layers can let the network device layer know if this packet is part of a network flow that is currently being receive offloaded and handle anything protocol specific that should happen for GRO. For example, the TCP protocol will need to decide if/when to ACK a packet that is being coalesced into an existing packet.
这个函数首先检查 GRO 是否启用,如果启用,则准备进行 GRO 操作。在 GRO 启用的情况下,会遍历一组 GRO 卸载过滤器,以便更高层的协议栈可以对正在考虑进行 GRO 的数据进行操作。这样做是为了让协议层能够告知网络设备层这个数据包是否属于当前正在进行接收卸载的网络流,并处理 GRO 所需的任何特定于协议的操作。例如,TCP 协议需要决定是否以及何时对正在合并到现有数据包中的数据包进行 ACK 响应。
Here’s the code from
net/core/dev.c
which does this:在
net/core/dev.c
中的代码如下:list_for_each_entry_rcu(ptype, head, list) { if (ptype->type != type || !ptype->callbacks.gro_receive) continue; skb_set_network_header(skb, skb_gro_offset(skb)); skb_reset_mac_len(skb); NAPI_GRO_CB(skb)->same_flow = 0; NAPI_GRO_CB(skb)->flush = 0; NAPI_GRO_CB(skb)->free = 0; pp = ptype->callbacks.gro_receive(&napi->gro_list, skb); break; }
If the protocol layers indicated that it is time to flush the GRO’d packet, that is taken care of next. This happens with a call to
napi_gro_complete
, which calls a gro_complete
callback for the protocol layers and then passes the packet up the stack by calling netif_receive_skb
.Here’s the code from
net/core/dev.c
which does this:if (pp) { struct sk_buff *nskb = *pp; *pp = nskb->next; nskb->next = NULL; napi_gro_complete(nskb); napi->gro_count--; }
Next, if the protocol layers merged this packet to an existing flow,
napi_gro_receive
simply returns as there’s nothing else to do.If the packet was not merged and there are fewer than
MAX_GRO_SKBS
(8) GRO flows on the system, a new entry is added to the gro_list
on the NAPI structure for this CPU.Here’s the code from
net/core/dev.c
which does this:if (NAPI_GRO_CB(skb)->flush || napi->gro_count >= MAX_GRO_SKBS) goto normal; napi->gro_count++; NAPI_GRO_CB(skb)->count = 1; NAPI_GRO_CB(skb)->age = jiffies; skb_shinfo(skb)->gso_size = skb_gro_len(skb); skb->next = napi->gro_list; napi->gro_list = skb; ret = GRO_HELD;
And that is how the GRO system in the Linux networking stack works.
napi_skb_finish
Once
dev_gro_receive
completes, napi_skb_finish
is called which either frees unneeded data structures because a packet has been merged, or calls netif_receive_skb
to pass the data up the network stack (because there were already MAX_GRO_SKBS
flows being GRO’d).Next, it’s time for
netif_receive_skb
to see how data is handed off to the protocol layers. Before this can be examined, we’ll need to take a look at Receive Packet Steering (RPS) first.Receive Packet Steering (RPS)(接收数据包导向(RPS))
Recall earlier how we discussed that network device drivers register a NAPI
poll
function. Each NAPI
poller instance is executed in the context of a softirq of which there is one per CPU. Further recall that the CPU which the driver’s IRQ handler runs on will wake its softirq processing loop to process packets.In other words: a single CPU processes the hardware interrupt and polls for packets to process incoming data.
Some NICs (like the Intel I350) support multiple queues at the hardware level. This means incoming packets can be DMA’d to a separate memory region for each queue, with a separate NAPI structure to manage polling this region, as well. Thus multiple CPUs will handle interrupts from the device and also process packets.
This feature is typically called Receive Side Scaling (RSS).
Receive Packet Steering (RPS) is a software implementation of RSS. Since it is implemented in software, this means it can be enabled for any NIC, even NICs which have only a single RX queue. However, since it is in software, this means that RPS can only enter into the flow after a packet has been harvested from the DMA memory region.
This means that you wouldn’t notice a decrease in CPU time spent handling IRQs or the NAPI
poll
loop, but you can distribute the load for processing the packet after it’s been harvested and reduce CPU time from there up the network stack.RPS works by generating a hash for incoming data to determine which CPU should process the data. The data is then enqueued to the per-CPU receive network backlog to be processed. An Inter-processor Interrupt (IPI) is delivered to the CPU owning the backlog. This helps to kick-start backlog processing if it is not currently processing data on the backlog. The
/proc/net/softnet_stat
contains a count of the number of times each softnet_data
struct has received an IPI (the received_rps
field).Thus,
netif_receive_skb
will either continue sending network data up the networking stack, or hand it over to RPS for processing on a different CPU.Tuning: Enabling RPS
For RPS to work, it must be enabled in the kernel configuration (it is on Ubuntu for kernel 3.13.0), and a bitmask describing which CPUs should process packets for a given interface and RX queue.
You can find some documentation about these bitmasks in the kernel documentation.
In short, the bitmasks to modify are found in:
/sys/class/net/DEVICE_NAME/queues/QUEUE/rps_cpus
So, for
eth0
and receive queue 0, you would modify the file: /sys/class/net/eth0/queues/rx-0/rps_cpus
with a hexadecimal number indicating which CPUs should process packets from eth0
’s receive queue 0. As the documentation points out, RPS may be unnecessary in certain configurations.Note: enabling RPS to distribute packet processing to CPUs which were previously not processing packets will cause the number of `NET_RX` softirqs to increase for that CPU, as well as the `si` or `sitime` in the CPU usage graph. You can compare before and after of your softirq and CPU usage graphs to confirm that RPS is configured properly to your liking.
Receive Flow Steering (RFS)
Receive flow steering (RFS) is used in conjunction with RPS. RPS attempts to distribute incoming packet load amongst multiple CPUs, but does not take into account any data locality issues for maximizing CPU cache hit rates. You can use RFS to help increase cache hit rates by directing packets for the same flow to the same CPU for processing.
Tuning: Enabling RFS
For RFS to work, you must have RPS enabled and configured.
RFS keeps track of a global hash table of all flows and the size of this hash table can be adjusted by setting the
net.core.rps_sock_flow_entries
sysctl.Increase the size of the RFS socket flow hash by setting a
sysctl
.$ sudo sysctl -w net.core.rps_sock_flow_entries=32768
Next, you can also set the number of flows per RX queue by writing this value to the sysfs file named
rps_flow_cnt
for each RX queue.Example: increase the number of flows for RX queue 0 on eth0 to 2048.
$ sudo bash -c 'echo 2048 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt'
Hardware accelerated Receive Flow Steering (aRFS)
RFS can be sped up with the use of hardware acceleration; the NIC and the kernel can work together to determine which flows should be processed on which CPUs. To use this feature, it must be supported by the NIC and your driver.
Consult your NIC’s data sheet to determine if this feature is supported. If your NIC’s driver exposes a function called
ndo_rx_flow_steer
, then the driver has support for accelerated RFS.Tuning: Enabling accelerated RFS (aRFS)
Assuming that your NIC and driver support it, you can enable accelerated RFS by enabling and configuring a set of things:
- Have RPS enabled and configured.
- Have RFS enabled and configured.
- Your kernel has
CONFIG_RFS_ACCEL
enabled at compile time. The Ubuntu kernel 3.13.0 does.
- Have ntuple support enabled for the device, as described previously. You can use
ethtool
to verify that ntuple support is enabled for the device.
- Configure your IRQ settings to ensure each RX queue is handled by one of your desired network processing CPUs.
Once the above is configured, accelerated RFS will be used to automatically move data to the RX queue tied to a CPU core that is processing data for that flow and you won’t need to specify an ntuple filter rule manually for each flow.
Moving up the network stack with netif_receive_skb
Picking up where we left off with
netif_receive_skb
, which is called from a few places. The two most common (and also the two we’ve already looked at):napi_skb_finish
if the packet is not going to be merged to an existing GRO’d flow, OR
napi_gro_complete
if the protocol layers indicated that it’s time to flush the flow, OR
Reminder:
netif_receive_skb
and its descendants are operating in the context of a the softirq processing loop and you'll see the time spent here accounted for as sitime
or si
with tools like top
.netif_receive_skb
begins by first checking a sysctl
value to determine if the user has requested receive timestamping before or after a packet hits the backlog queue. If this setting is enabled, the data is timestamped now, prior to it hitting RPS (and the CPU’s associated backlog queue). If this setting is disabled, it will be timestamped after it hits the queue. This can be used to distribute the load of timestamping amongst multiple CPUs if RPS is enabled, but will introduce some delay as a result.Easy to use Maven repositories, free.
Tuning: RX packet timestamping
You can tune when packets will be timestamped after they are received by adjusting a sysctl named
net.core.netdev_tstamp_prequeue
:Disable timestamping for RX packets by adjusting a
sysctl
$ sudo sysctl -w net.core.netdev_tstamp_prequeue=0
The default value is 1. Please see the previous section for an explanation as to what this setting means, exactly.
netif_receive_skb
After the timestamping is dealt with,
netif_receive_skb
operates differently depending on whether or not RPS is enabled. Let’s start with the simpler path first: RPS disabled.Without RPS (default setting)
If RPS is not enabled,
__netif_receive_skb
is called which does some bookkeeping and then calls __netif_receive_skb_core
to move data closer to the protocol stacks.We’ll see precisely how
__netif_receive_skb_core
works, but first let’s see how the RPS enabled code path works, as that code will also call __netif_receive_skb_core
.With RPS enabled
If RPS is enabled, after the timestamping options mentioned above are dealt with,
netif_receive_skb
will perform some computations to determine which CPU’s backlog queue should be used. This is done by using the function get_rps_cpu
. From net/core/dev.c:cpu = get_rps_cpu(skb->dev, skb, &rflow); if (cpu >= 0) { ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail); rcu_read_unlock(); return ret; }
get_rps_cpu
will take into account RFS and aRFS settings as described above to ensure the the data gets queued to the desired CPU’s backlog with a call to enqueue_to_backlog
.enqueue_to_backlog
This function begins by getting a pointer to the remote CPU’s
softnet_data
structure, which contains a pointer to the input_pkt_queue
. Next, the queue length of the input_pkt_queue
of the remote CPU is checked. From net/core/dev.c:qlen = skb_queue_len(&sd->input_pkt_queue); if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) {
The length of
input_pkt_queue
is first compared to netdev_max_backlog
. If the queue is longer than this value, the data is dropped. Similarly, the flow limit is checked and if it has been exceeded, the data is dropped. In both cases the drop count on the softnet_data
structure is incremented. Note that this is the softnet_data
structure of the CPU the data was going to be queued to. Read the section above about /proc/net/softnet_stat
to learn how to get the drop count for monitoring purposes.enqueue_to_backlog
is not called in many places. It is called for RPS-enabled packet processing and also from netif_rx
. Most drivers should not be using netif_rx
and should instead be using netif_receive_skb
. If you are not using RPS and your driver is not using netif_rx
, increasing the backlog won’t produce any noticeable effect on your system as it is not used.Note: You need to check the driver you are using. If it calls
netif_receive_skb
and you are not using RPS, increasing the netdev_max_backlog
will not yield any performance improvement because no data will ever make it to the input_pkt_queue
.Assuming that the
input_pkt_queue
is small enough and the flow limit (more about this, next) hasn’t been reached (or is disabled), the data can be queued. The logic here is a bit funny, but can be summarized as:- If the queue is empty: check if NAPI has been started on the remote CPU. If not, check if an IPI is queued to be sent. If not, queue one and start the NAPI processing loop by calling
____napi_schedule
. Proceed to queuing the data.
- If the queue is not empty, or the previously described operation has completed, enqueue the data.
The code is a bit tricky with its use of
goto
, so read it carefully. From net/core/dev.c:if (skb_queue_len(&sd->input_pkt_queue)) { enqueue: __skb_queue_tail(&sd->input_pkt_queue, skb); input_queue_tail_incr_save(sd, qtail); rps_unlock(sd); local_irq_restore(flags); return NET_RX_SUCCESS; } /* Schedule NAPI for backlog device * We can use non atomic operation since we own the queue lock */ if (!__test_and_set_bit(NAPI_STATE_SCHED, &sd->backlog.state)) { if (!rps_ipi_queued(sd)) ____napi_schedule(sd, &sd->backlog); } goto enqueue;
Flow limits
RPS distributes packet processing load amongst multiple CPUs, but a single large flow can monopolize CPU processing time and starve smaller flows. Flow limits are a feature that can be used to limit the number of packets queued to the backlog for each flow to a certain amount. This can help ensure that smaller flows are processed even though much larger flows are pushing packets in.
The if statement above from net/core/dev.c checks the flow limit with a call to
skb_flow_limit
:if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) {
This code is checking that there is still room in the queue and that the flow limit has not been reached. By default, flow limits are disabled. In order to enable flow limits, you must specify a bitmap (similar to RPS’ bitmap).
Monitoring: Monitor drops due to full input_pkt_queue
or flow limit
See the section above about monitoring
/proc/net/softnet_stat
. The dropped
field is a counter that gets incremented each time data is dropped instead of queued to a CPU’s input_pkt_queue
.Tuning
Tuning: Adjusting
netdev_max_backlog
to prevent dropsBefore adjusting this tuning value, see the note in the previous section.
You can help prevent drops in
enqueue_to_backlog
by increasing the netdev_max_backlog
if you are using RPS or if your driver calls netif_rx
.Example: increase backlog to 3000 with
sysctl
.$ sudo sysctl -w net.core.netdev_max_backlog=3000
The default value is 1000.
Tuning: Adjust the NAPI weight of the backlog
poll
loopYou can adjust the weight of the backlog’s NAPI poller by setting the
net.core.dev_weight
sysctl. Adjusting this value determines how much of the overall budget the backlog poll
loop can consume (see the section above about adjusting net.core.netdev_budget
):Example: increase the NAPI
poll
backlog processing loop with sysctl
.$ sudo sysctl -w net.core.dev_weight=600
The default value is 64.
Remember, backlog processing runs in the softirq context similar to the device driver’s registered
poll
function and will be limited by the overall budget
and a time limit, as described in previous sections.Tuning: Enabling flow limits and tuning flow limit hash table size
Set the size of the flow limit table with a
sysctl
.$ sudo sysctl -w net.core.flow_limit_table_len=8192
The default value is 4096.
This change only affects newly allocated flow hash tables. So, if you’d like to increase the table size, you should do it before you enable flow limits.
To enable flow limits you should specify a bitmask in
/proc/sys/net/core/flow_limit_cpu_bitmap
similar to the RPS bitmask which indicates which CPUs have flow limits enabled.backlog queue NAPI poller
The per-CPU backlog queue plugs into NAPI the same way a device driver does. A
poll
function is provided that is used to process packets from the softirq context. A weight
is also provided, just as a device driver would.This NAPI struct is provided during initialization of the networking system. From
net_dev_init
in net/core/dev.c
:sd->backlog.poll = process_backlog; sd->backlog.weight = weight_p; sd->backlog.gro_list = NULL; sd->backlog.gro_count = 0;
The backlog NAPI structure differs from the device driver NAPI structure in that the
weight
parameter is adjustable, where as drivers hardcode their NAPI weight to 64. We’ll see in the tuning section below how to adjust the weight using a sysctl
.process_backlog
The
process_backlog
function is a loop which runs until its weight (as described in the previous section) has been consumed or no more data remains on the backlog.Each piece of data on the backlog queue is removed from the backlog queue and passed on to
__netif_receive_skb
. The code path once the data hits __netif_receive_skb
is the same as explained above for the RPS disabled case. Namely, __netif_receive_skb
does some bookkeeping prior to calling __netif_receive_skb_core
to pass network data up to the protocol layers.process_backlog
follows the same contract with NAPI that device drivers do, which is: NAPI is disabled if the total weight will not be used. The poller is restarted with the call to ____napi_schedule
from enqueue_to_backlog
as described above.The function returns the amount of work done, which
net_rx_action
(described above) will subtract from the budget (which is adjusted with the net.core.netdev_budget
, as described above).__netif_receive_skb_core
delivers data to packet taps and protocol layers
__netif_receive_skb_core
performs the heavy lifting of delivering the data to protocol stacks. Before it does this, it checks if any packet taps have been installed which are catching all incoming packets. One example of something that does this is the AF_PACKET
address family, typically used via the libpcap library.If such a tap exists, the data is delivered there first then to the protocol layers next.
Packet tap delivery
If a packet tap is installed (usually via libpcap), the packet is delivered there with the following code from net/core/dev.c:
list_for_each_entry_rcu(ptype, &ptype_all, list) { if (!ptype->dev || ptype->dev == skb->dev) { if (pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; } }
If you are curious about how the path of the data through pcap, read net/packet/af_packet.c.
Protocol layer delivery
Once the taps have been satisfied,
__netif_receive_skb_core
delivers data to protocol layers. It does this by obtaining the protocol field from the data and iterating across a list of deliver functions registered for that protocol type.This can be seen in
__netif_receive_skb_core
in net/core/dev.c:type = skb->protocol; list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) { if (ptype->type == type && (ptype->dev == null_or_dev || ptype->dev == skb->dev || ptype->dev == orig_dev)) { if (pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; } }
The
ptype_base
identifier above is defined as a hash table of lists in net/core/dev.c:struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
Each protocol layer adds a filter to a list at a given slot in the hash table, computed with a helper function called
ptype_head
:static inline struct list_head *ptype_head(const struct packet_type *pt) { if (pt->type == htons(ETH_P_ALL)) return &ptype_all; else return &ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK]; }
Adding a filter to the list is accomplished with a call to
dev_add_pack
. That is how protocol layers register themselves for network data delivery for their protocol type.And now you know how network data gets from the NIC to the protocol layer.
Protocol layer registration
Now that we know how data is delivered to the protocol stacks from the network device subsystem, let’s see how a protocol layer registers itself.
This blog post is going to examine the IP protocol stack as it is a commonly used protocol and will be relevant to most readers.
IP protocol layer
The IP protocol layer plugs itself into the
ptype_base
hash table so that data will be delivered to it from the network device layer described in previous sections.This happens in the function
inet_init
from net/ipv4/af_inet.c:dev_add_pack(&ip_packet_type);
This registers the IP packet type structure defined at net/ipv4/af_inet.c:
static struct packet_type ip_packet_type __read_mostly = { .type = cpu_to_be16(ETH_P_IP), .func = ip_rcv, };
__netif_receive_skb_core
calls deliver_skb
(as seen in the above section), which calls func
(in this case, ip_rcv
).ip_rcv
The
ip_rcv
function is pretty straight-forward at a high level. There are several integrity checks to ensure the data is valid. Statistics counters are bumped, as well.ip_rcv
ends by passing the packet to ip_rcv_finish
by way of netfilter. This is done so that any iptables rules that should be matched at the IP protocol layer can take a look at the packet before it continues on.We can see the code which hands the data over to netfilter at the end of
ip_rcv
in net/ipv4/ip_input.c:return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish);
netfilter and iptables
In the interest of brevity (and my RSI), I’ve decided to skip my deep dive into netfilter, iptables, and conntrack.
The short version is that
NF_HOOK_THRESH
will check if any filters are installed and attempt to return execution back to the IP protocol layer to avoid going deeper into netfilter and anything that hooks in below that like iptables and conntrack.Keep in mind: if you have numerous or very complex netfilter or iptables rules, those rules will be executed in the softirq context and can lead to latency in your network stack. This may be unavoidable, though, if you need to have a particular set of rules installed.
ip_rcv_finish
Once net filter has had a chance to take a look at the data and decide what to do with it,
ip_rcv_finish
is called. This only happens if the data is not being dropped by netfilter, of course.ip_rcv_finish
begins with an optimization. In order to deliver the packet to proper place, a dst_entry
from the routing system needs to be in place. In order to obtain one, the code initially attempts to call the early_demux
function from the higher level protocol that this data is destined for.The
early_demux
routine is an optimization which attempts to find the dst_entry
needed to deliver the packet by checking if a dst_entry
is cached on the socket structure.Here’s what that looks like from net/ipv4/ip_input.c:
if (sysctl_ip_early_demux && !skb_dst(skb) && skb->sk == NULL) { const struct net_protocol *ipprot; int protocol = iph->protocol; ipprot = rcu_dereference(inet_protos[protocol]); if (ipprot && ipprot->early_demux) { ipprot->early_demux(skb); /* must reload iph, skb->head might have changed */ iph = ip_hdr(skb); } }
As you can see above, this code is guarded by a sysctl
sysctl_ip_early_demux
. By default early_demux
is enabled. The next section includes information about how to disable it and why you might want to.If the optimization is enabled and there is no cached entry (because this is the first packet arriving), the packet will be handed off to the routing system in the kernel where the
dst_entry
will be computed and assigned.Once the routing layer completes, statistics counters are updated and the function ends by calling
dst_input(skb)
which in turn calls the input function pointer on the packet’s dst_entry
structure that was affixed by the routing system.If the packet’s final destination is the local system, the routing system will attach the function
ip_local_deliver
to the input function pointer in the dst_entry
structure on the packet.Tuning: adjusting IP protocol early demux
Disable the
early_demux
optimization by setting a sysctl
.$ sudo sysctl -w net.ipv4.ip_early_demux=0
The default value is 1;
early_demux
is enabled.This sysctl was added as some users saw a ~5% decrease in throughput with the
early_demux
optimization in some cases.ip_local_deliver
Recall how we saw the following pattern in the IP protocol layer:
- Calls to
ip_rcv
do some initial bookkeeping.
- Packet is handed off to netfilter for processing, with a pointer to a callback to be executed when processing finishes.
ip_rcv_finish
is the callback which finished processing and continued working toward pushing the packet up the networking stack.
ip_local_deliver
has the same pattern. From net/ipv4/ip_input.c:/* * Deliver IP Packets to the higher protocol layers. */ int ip_local_deliver(struct sk_buff *skb) { /* * Reassemble IP fragments. */ if (ip_is_fragment(ip_hdr(skb))) { if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER)) return 0; } return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL, ip_local_deliver_finish); }
Once netfilter has had a chance to take a look at the data,
ip_local_deliver_finish
will be called, assuming the data is not dropped first by netfilter.ip_local_deliver_finish
ip_local_deliver_finish
obtains the protocol from the packet, looks up a net_protocol
structure registered for that protocol, and calls the function pointed to by handler
in the net_protocol
structure.This hands the packet up to the higher level protocol layer.
Monitoring: IP protocol layer statistics
Monitor detailed IP protocol statistics by reading
/proc/net/snmp
.$ cat /proc/net/snmp Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates Ip: 1 64 25922988125 0 0 15771700 0 0 25898327616 22789396404 12987882 51 1 10129840 2196520 1 0 0 0 ...
This file contains statistics for several protocol layers. The IP protocol layer appears first. The first line contains space separate names for each of the corresponding values in the next line.
In the IP protocol layer, you will find statistics counters being bumped. Those counters are referenced by a C enum. All of the valid enum values and the field names they correspond to in
/proc/net/snmp
can be found in include/uapi/linux/snmp.h:enum { IPSTATS_MIB_NUM = 0, /* frequently written fields in fast path, kept in same cache line */ IPSTATS_MIB_INPKTS, /* InReceives */ IPSTATS_MIB_INOCTETS, /* InOctets */ IPSTATS_MIB_INDELIVERS, /* InDelivers */ IPSTATS_MIB_OUTFORWDATAGRAMS, /* OutForwDatagrams */ IPSTATS_MIB_OUTPKTS, /* OutRequests */ IPSTATS_MIB_OUTOCTETS, /* OutOctets */ /* ... */
Monitor extended IP protocol statistics by reading
/proc/net/netstat
.$ cat /proc/net/netstat | grep IpExt IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT0Pktsu InCEPkts IpExt: 0 0 0 0 277959 0 14568040307695 32991309088496 0 0 58649349 0 0 0 0 0
The format is similar to
/proc/net/snmp
, except the lines are prefixed with IpExt
.Some interesting statistics:
InReceives
: The total number of IP packets that reachedip_rcv
before any data integrity checks.
InHdrErrors
: Total number of IP packets with corrupted headers. The header was too short, too long, non-existent, had the wrong IP protocol version number, etc.
InAddrErrors
: Total number of IP packets where the host was unreachable.
ForwDatagrams
: Total number of IP packets that have been forwarded.
InUnknownProtos
: Total number of IP packets with unknown or unsupported protocol specified in the header.
InDiscards
: Total number of IP packets discarded due to memory allocation failure or checksum failure when packets are trimmed.
InDelivers
: Total number of IP packets successfully delivered to higher protocol layers. Keep in mind that those protocol layers may drop data even if the IP layer does not.
InCsumErrors
: Total number of IP Packets with checksum errors.
Note that each of these is incremented in really specific locations in the IP layer. Code gets moved around from time to time and double counting errors or other accounting bugs can sneak in. If these statistics are important to you, you are strongly encouraged to read the IP protocol layer source code for the metrics that are important to you so you understand when they are (and are not) being incremented.
Higher level protocol registration
This blog post will examine UDP, but the TCP protocol handler is registered the same way and at the same time as the UDP protocol handler.
In
net/ipv4/af_inet.c
, the structure definitions which contain the handler functions for connecting the UDP, TCP , and ICMP protocols to the IP protocol layer can be found. From net/ipv4/af_inet.c:static const struct net_protocol tcp_protocol = { .early_demux = tcp_v4_early_demux, .handler = tcp_v4_rcv, .err_handler = tcp_v4_err, .no_policy = 1, .netns_ok = 1, }; static const struct net_protocol udp_protocol = { .early_demux = udp_v4_early_demux, .handler = udp_rcv, .err_handler = udp_err, .no_policy = 1, .netns_ok = 1, }; static const struct net_protocol icmp_protocol = { .handler = icmp_rcv, .err_handler = icmp_err, .no_policy = 1, .netns_ok = 1, };
These structures are registered in the initialization code of the inet address family. From net/ipv4/af_inet.c:
/* * Add all the base protocols. */ if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0) pr_crit("%s: Cannot add ICMP protocol\n", __func__); if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0) pr_crit("%s: Cannot add UDP protocol\n", __func__); if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0) pr_crit("%s: Cannot add TCP protocol\n", __func__);
We’re going to be looking at the UDP protocol layer. As seen above, the
handler
function for UDP is called udp_rcv
.This is the entry point into the UDP layer where the IP layer hands data. Let’s continue our journey there.
UDP protocol layer
The code for the UDP protocol layer can be found in: net/ipv4/udp.c.
udp_rcv
The code for the
udp_rcv
function is just a single line which calls directly into __udp4_lib_rcv
to handle receiving the datagram.__udp4_lib_rcv
The
__udp4_lib_rcv
function will check to ensure the packet is valid and obtain the UDP header, UDP datagram length, source address, and destination address. Next, are some additional integrity checks and checksum verification.Recall that earlier in the IP protocol layer, we saw that an optimization is performed to attach a
dst_entry
to the packet before it is handed off to the upper layer protocol (UDP in our case).If a socket and corresponding
dst_entry
were found, __udp4_lib_rcv
will queue the packet to the socket:sk = skb_steal_sock(skb); if (sk) { struct dst_entry *dst = skb_dst(skb); int ret; if (unlikely(sk->sk_rx_dst != dst)) udp_sk_rx_dst_set(sk, dst); ret = udp_queue_rcv_skb(sk, skb); sock_put(sk); /* a return value > 0 means to resubmit the input, but * it wants the return to be -protocol, or 0 */ if (ret > 0) return -ret; return 0; } else {
If there is no socket attached from the early_demux operation, a receiving socket will now be looked up by calling
__udp4_lib_lookup_skb
.In both cases described above, the datagram will be queued to the socket:
ret = udp_queue_rcv_skb(sk, skb); sock_put(sk);
If no socket was found, the datagram will be dropped:
/* No socket. Drop packet silently, if checksum is wrong */ if (udp_lib_checksum_complete(skb)) goto csum_error; UDP_INC_STATS_BH(net, UDP_MIB_NOPORTS, proto == IPPROTO_UDPLITE); icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0); /* * Hmm. We got an UDP packet to a port to which we * don't wanna listen. Ignore it. */ kfree_skb(skb); return 0;
udp_queue_rcv_skb
The initial parts of this function are as follows:
- Determine if the socket associated with the datagram is an encapsulation socket. If so, pass the packet up to that layer’s handler function before proceeding.
- Determine if the datagram is a UDP-Lite datagram and do some integrity checks.
- Verify the UDP checksum of the datagram and drop it if the checksum fails.
Finally, we arrive at the receive queue logic which begins by checking if the receive queue for the socket is full. From
net/ipv4/udp.c
:if (sk_rcvqueues_full(sk, skb, sk->sk_rcvbuf)) goto drop;
sk_rcvqueues_full
The
sk_rcvqueues_full
function checks the socket’s backlog length and the socket’s sk_rmem_alloc
to determine if the sum is greater than the sk_rcvbuf
for the socket (sk->sk_rcvbuf
in the above code snippet):/* * Take into account size of receive queue and backlog queue * Do not take into account this skb truesize, * to allow even a single big packet to come. */ static inline bool sk_rcvqueues_full(const struct sock *sk, const struct sk_buff *skb, unsigned int limit) { unsigned int qsize = sk->sk_backlog.len + atomic_read(&sk->sk_rmem_alloc); return qsize > limit; }
Tuning these values is a bit tricky as there are many things that can be adjusted.
Tuning: Socket receive queue memory
The
sk->sk_rcvbuf
(called limit in sk_rcvqueues_full
above) value can be increased to whatever the sysctl net.core.rmem_max
is set to.Increase the maximum receive buffer size by setting a
sysctl
.$ sudo sysctl -w net.core.rmem_max=8388608
sk->sk_rcvbuf
starts at the net.core.rmem_default
value, which can also be adjusted by setting a sysctl, like so:Adjust the default initial receive buffer size by setting a
sysctl
.$ sudo sysctl -w net.core.rmem_default=8388608
You can also set the
sk->sk_rcvbuf
size by calling setsockopt
from your application and passing SO_RCVBUF
. The maximum you can set with setsockopt
is net.core.rmem_max
.However, you can override the
net.core.rmem_max
limit by calling setsockopt
and passing SO_RCVBUFFORCE
, but the user running the application will need the CAP_NET_ADMIN
capability.The
sk->sk_rmem_alloc
value is incremented by calls to skb_set_owner_r
which set the owner socket of a datagram. We’ll see this called later in the UDP layer.The
sk->sk_backlog.len
is incremented by calls to sk_add_backlog
, which we’ll see next.udp_queue_rcv_skb
Once it’s been verified that the queue is not full, progress toward queuing the datagram can continue. From net/ipv4/udp.c:
bh_lock_sock(sk); if (!sock_owned_by_user(sk)) rc = __udp_queue_rcv_skb(sk, skb); else if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) { bh_unlock_sock(sk); goto drop; } bh_unlock_sock(sk); return rc;
The first step is determine if the socket currently has any system calls against it from a userland program. If it does not, the datagram can be added to the receive queue with a call to
__udp_queue_rcv_skb
. If it does, the datagram is queued to the backlog with a call to sk_add_backlog
.The datagrams on the backlog are added to the receive queue when socket system calls release the socket with a call to
release_sock
in the kernel.__udp_queue_rcv_skb
The
__udp_queue_rcv_skb
function adds datagrams to the receive queue by calling sock_queue_rcv_skb
and bumps statistics counters if the datagram could not be added to the receive queue for the socket.From net/ipv4/udp.c:
rc = sock_queue_rcv_skb(sk, skb); if (rc < 0) { int is_udplite = IS_UDPLITE(sk); /* Note that an ENOMEM error is charged twice */ if (rc == -ENOMEM) UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_RCVBUFERRORS,is_udplite); UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS, is_udplite); kfree_skb(skb); trace_udp_fail_queue_rcv_skb(rc, sk); return -1; }
Monitoring: UDP protocol layer statistics
Two very useful files for getting UDP protocol statistics are:
/proc/net/snmp
/proc/net/udp
/proc/net/snmp
Monitor detailed UDP protocol statistics by reading
/proc/net/snmp
.$ cat /proc/net/snmp | grep Udp\: Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors Udp: 16314 0 0 17161 0 0
Much like the detailed statistics found in this file for the IP protocol, you will need to read the protocol layer source to determine exactly when and where these values are incremented.
InDatagrams
: Incremented whenrecvmsg
was used by a userland program to read datagram. Also incremented when a UDP packet is encapsulated and sent back for processing.
NoPorts
: Incremented when UDP packets arrive destined for a port where no program is listening.
InErrors
: Incremented in several cases: no memory in the receive queue, when a bad checksum is seen, and ifsk_add_backlog
fails to add the datagram.
OutDatagrams
: Incremented when a UDP packet is handed down without error to the IP protocol layer to be sent.
RcvbufErrors
: Incremented whensock_queue_rcv_skb
reports that no memory is available; this happens ifsk->sk_rmem_alloc
is greater than or equal tosk->sk_rcvbuf
.
SndbufErrors
: Incremented if the IP protocol layer reported an error when trying to send the packet and no error queue has been setup. Also incremented if no send queue space or kernel memory are available.
InCsumErrors
: Incremented when a UDP checksum failure is detected. Note that in all cases I could find,InCsumErrors
is incrememnted at the same time asInErrors
. Thus,InErrors
-InCsumErros
should yield the count of memory related errors on the receive side.
/proc/net/udp
Monitor UDP socket statistics by reading
/proc/net/udp
$ cat /proc/net/udp sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops 515: 00000000:B346 00000000:0000 07 00000000:00000000 00:00000000 00000000 104 0 7518 2 0000000000000000 0 558: 00000000:0371 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7408 2 0000000000000000 0 588: 0100007F:038F 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7511 2 0000000000000000 0 769: 00000000:0044 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7673 2 0000000000000000 0 812: 00000000:006F 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7407 2 0000000000000000 0
The first line describes each of the fields in the lines following:
sl
: Kernel hash slot for the socket
local_address
: Hexadecimal local address of the socket and port number, separated by:
.
rem_address
: Hexadecimal remote address of the socket and port number, separated by:
.
st
: The state of the socket. Oddly enough, the UDP protocol layer seems to use some TCP socket states. In the example above,7
isTCP_CLOSE
.
tx_queue
: The amount of memory allocated in the kernel for outgoing UDP datagrams.
rx_queue
: The amount of memory allocated in the kernel for incoming UDP datagrams.
tr
,tm->when
,retrnsmt
: These fields are unused by the UDP protocol layer.
uid
: The effective user id of the user who created this socket.
timeout
: Unused by the UDP protocol layer.
inode
: The inode number corresponding to this socket. You can use this to help you determine which user process has this socket open. Check/proc/[pid]/fd
, which will contain symlinks tosocket[:inode]
.
ref
: The current reference count for the socket.
pointer
: The memory address in the kernel of thestruct sock
.
drops
: The number of datagram drops associated with this socket. Note that this does not include any drops related to sending datagrams (on corked UDP sockets or otherwise); this is only incremented in receive paths as of the kernel version examined by this blog post.
The code which outputs this can be found in
net/ipv4/udp.c
.Queuing data to a socket
Network data is queued to a socket with a call to
sock_queue_rcv
. This function does a few things before adding the datagram to the queue:- The socket’s allocated memory is checked to determine if it has exceeded the receive buffer size. If so, the drop count for the socket is incremented.
- Next
sk_filter
is used to process any Berkeley Packet Filter filters that have been applied to the socket.
sk_rmem_schedule
is run to ensure sufficient receive buffer space exists to accept this datagram.
- Next the size of the datagram is charged to the socket with a call to
skb_set_owner_r
. This incrementssk->sk_rmem_alloc
.
- The data is added to the queue with a call to
__skb_queue_tail
.
- Finally, any processes waiting on data to arrive in the socket are notified with a call to the
sk_data_ready
notification handler function.
And that is how data arrives at a system and traverses the network stack until it reaches a socket and is ready to be read by a user program.
Extras
There are a few extra things worth mentioning that are worth mentioning which didn’t seem quite right anywhere else.
Timestamping
As mentioned in the above blog post, the networking stack can collect timestamps of incoming data. There are sysctl values controlling when/how to collect timestamps when used in conjunction with RPS; see the above post for more information on RPS, timestamping, and where, exactly, in the network stack receive timestamping happens. Some NICs even support timestamping in hardware, too.
This is a useful feature if you’d like to try to determine how much latency the kernel network stack is adding to receiving packets.
The kernel documentation about timestamping is excellent and there is even an included sample program and Makefile you can check out!.
Determine which timestamp modes your driver and device support with
ethtool -T
.$ sudo ethtool -T eth0 Time stamping parameters for eth0: Capabilities: software-transmit (SOF_TIMESTAMPING_TX_SOFTWARE) software-receive (SOF_TIMESTAMPING_RX_SOFTWARE) software-system-clock (SOF_TIMESTAMPING_SOFTWARE) PTP Hardware Clock: none Hardware Transmit Timestamp Modes: none Hardware Receive Filter Modes: none
This NIC, unfortunately, does not support hardware receive timestamping, but software timestamping can still be used on this system to help me determine how much latency the kernel is adding to my packet receive path.
Busy polling for low latency sockets
It is possible to use a socket option called
SO_BUSY_POLL
which will cause the kernel to busy poll for new data when a blocking receive is done and there is no data.IMPORTANT NOTE: For this option to work, your device driver must support it. Linux kernel 3.13.0’s
igb
driver does not support this option. The ixgbe
driver, however, does. If your driver has a function set to the ndo_busy_poll
field of its struct net_device_ops
structure (mentioned in the above blog post), it supports SO_BUSY_POLL
.A great paper explaining how this works and how to use it is available from Intel.
When using this socket option for a single socket, you should pass a time value in microseconds as the amount of time to busy poll in the device driver’s receive queue for new data. When you issue a blocking read to this socket after setting this value, the kernel will busy poll for new data.
You can also set the sysctl value
net.core.busy_poll
to a time value in microseconds of how long calls with poll
or select
should busy poll waiting for new data to arrive, as well.This option can reduce latency, but will increase CPU usage and power consumption.
Netpoll: support for networking in critical contexts
The Linux kernel provides a way for device drivers to be used to send and receive data on a NIC when the kernel has crashed. The API for this is called Netpoll and it is used by a few things, but most notably: kgdb, netconsole.
Most drivers support Netpoll; your driver needs to implement the
ndo_poll_controller
function and attach it to the struct net_device_ops
that is registered during probe (as seen above).When the networking device subsystem performs operations on incoming or outgoing data, the netpoll system is checked first to determine if the packet is destined for netpoll.
For example, we can see the following code in
__netif_receive_skb_core
from net/dev/core.c
:static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc) { /* ... */ /* if we've gotten here through NAPI, check netpoll */ if (netpoll_receive_skb(skb)) goto out; /* ... */ }
The Netpoll checks happen early in most of the Linux network device subsystem code that deals with transmitting or receiving network data.
Consumers of the Netpoll API can register
struct netpoll
structures by calling netpoll_setup
. The struct netpoll
structure has function pointers for attaching receive hooks, and the API exports a function for sending data.If you are interested in using the Netpoll API, you should take a look at the
netconsole
driver, the Netpoll API header file, ‘include/linux/netpoll.h`, and this excellent talk.SO_INCOMING_CPU
The
SO_INCOMING_CPU
flag was not added until Linux 3.19, but it is useful enough that it should be included in this blog post.You can use
getsockopt
with the SO_INCOMING_CPU
option to determine which CPU is processing network packets for a particular socket. Your application can then use this information to hand sockets off to threads running on the desired CPU to help increase data locality and CPU cache hits.The mailing list message introducing
SO_INCOMING_CPU
provides a short example architecture where this option is useful.DMA Engines
A DMA engine is a piece of hardware that allows the CPU to offload large copy operations. This frees the CPU to do other tasks while memory copies are done with hardware. Enabling the use of a DMA engine and running code that takes advantage of it, should yield reduced CPU usage.
The Linux kernel has a generic DMA engine interface that DMA engine driver authors can plug into. You can read more about the Linux DMA engine interface in the kernel source Documentation.
While there are a few DMA engines that the kernel supports, we’re going to discuss one in particular that is quite common: the Intel IOAT DMA engine.
Intel’s I/O Acceleration Technology (IOAT)
Many servers include the Intel I/O AT bundle, which is comprised of a series of performance changes.
One of those changes is the inclusion of a hardware DMA engine. You can check your
dmesg
output for ioatdma
to determine if the module is being loaded and if it has found supported hardware.The DMA offload engine is used in a few places, most notably in the TCP stack.
Support for the Intel IOAT DMA engine was included in Linux 2.6.18, but was disabled later in 3.13.11.10 due to some unfortunate data corruption bugs.
Users on kernels before 3.13.11.10 may be using the
ioatdma
module on their servers by default. Perhaps this will be fixed in future kernel releases.Direct cache access (DCA)
Another interesting feature included with the Intel I/O AT bundle is Direct Cache Access (DCA).
This feature allows network devices (via their drivers) to place network data directly in the CPU cache. How this works, exactly, is driver specific. For the
igb
driver, you can check the code for the function igb_update_dca
, as well as the code for igb_update_rx_dca
. The igb
driver uses DCA by writing a register value to the NIC.To use DCA, you will need to ensure that DCA is enabled in your BIOS, the
dca
module is loaded, and that your network card and driver both support DCA.Create a secure APT repository in less than 10 seconds, free.
Monitoring IOAT DMA engine
If you are using the
ioatdma
module despite the risk of data corruption mentioned above, you can monitor it by examining some entries in sysfs
.Monitor the total number of offloaded
memcpy
operations for a DMA channel.$ cat /sys/class/dma/dma0chan0/memcpy_count 123205655
Similarly, to get the number of bytes offloaded by this DMA channel, you’d run a command like:
Monitor total number of bytes transferred for a DMA channel.
$ cat /sys/class/dma/dma0chan0/bytes_transferred 131791916307
Tuning IOAT DMA engine
The IOAT DMA engine is only used when packet size is above a certain threshold. That threshold is called the
copybreak
. This check is in place because for small copies, the overhead of setting up and using the DMA engine is not worth the accelerated transfer.Adjust the DMA engine copybreak with a
sysctl
.$ sudo sysctl -w net.ipv4.tcp_dma_copybreak=2048
The default value is 4096.
Conclusion
The Linux networking stack is complicated.
It is impossible to monitor or tune it (or any other complex piece of software) without understanding at a deep level exactly what’s going on. Often, out in the wild of the Internet, you may stumble across a sample
sysctl.conf
that contains a set of sysctl values that should be copied and pasted on to your computer. This is probably not the best way to optimize your networking stack.Monitoring the networking stack requires careful accounting of network data at every layer. Starting with the drivers and proceeding up. That way you can determine where exactly drops and errors are occurring and then adjust settings to determine how to reduce the errors you are seeing.
There is, unfortunately, no easy way out.
欢迎加入“喵星计算机技术研究院”,原创技术文章第一时间推送。

- 作者:tangcuyu
- 链接:https://expoli.tech/articles/2025/04/03/Monitoring%20and%20Tuning%20the%20Linux%20Networking%20Stack%3A%20Receiving%20Data%20%7C%20Packagecloud%20Blog
- 声明:本文采用 CC BY-NC-SA 4.0 许可协议,转载请注明出处。
相关文章